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INTRODUCTION TO ASICs 



An ASIC (pronounced a-sick; bold typeface defines a new term) is an 
application-specific integrated circuit at least that is what the acronym stands for. 
Before we answer the question of what that means we first look at the evolution 
of the silicon chip or integrated circuit ( IC ). 

Figure 1.1(a) shows an IC package (this is a pin-grid array, or PGA, shown 
upside down; the pins will go through holes in a printed-circuit board). People 
often call the package a chip, but, as you can see in Figure 1.1(b), the silicon chip 
itself (more properly called a die ) is mounted in the cavity under the sealed lid. 

A PGA package is usually made from a ceramic material, but plastic packages 
are also common. 



FIGURE 1 . 1 An integrated 
circuit (IC). (a) A pin-grid 
array (PGA) package, (b) The 
silicon die or chip is under 
the package lid. 




The physical size of a silicon die varies from a few millimeters on a side to over 
1 inch on a side, but instead we often measure the size of an IC by the number of 
logic gates or the number of transistors that the IC contains. As a unit of measure 
a gate equivalent corresponds to a two-input NAND gate (a circuit that performs 
the logic function, F = A " B ). Often we just use the term gates instead of gate 
equivalents when we are measuring chip sizenot to be confused with the gate 
terminal of a transistor. For example, a 100 k-gate IC contains the equivalent of 
100,000 two-input NAND gates. 

The semiconductor industry has evolved from the first ICs of the early 1970s and 
matured rapidly since then. Early small-scale integration ( SSI ) ICs contained a 
few (1 to 10) logic gatesNAND gates, NOR gates, and so onamounting to a few 
tens of transistors. The era of medium-scale integration ( MSI ) increased the 
range of integrated logic available to counters and similar, larger scale, logic 
functions. The era of large-scale integration ( LSI ) packed even larger logic 




functions, such as the first microprocessors, into a single chip. The era of very 
large-scale integration ( VLSI ) now offers 64-bit microprocessors, complete with 
cache memory and floating-point arithmetic unitswell over a million transistors 
on a single piece of silicon. As CMOS process technology improves, transistors 
continue to get smaller and ICs hold more and more transistors. Some people 
(especially in Japan) use the term ultralarge scale integration ( ULSI ), but most 
people stop at the term VLSI; otherwise we have to start inventing new words. 

The earliest ICs used bipolar technology and the majority of logic ICs used either 
transistortransistor logic ( TTL ) or emitter-coupled logic (ECL). Although 
invented before the bipolar transistor, the metal-oxide-silicon ( MOS ) transistor 
was initially difficult to manufacture because of problems with the oxide 
interface. As these problems were gradually solved, metal-gate n -channel MOS ( 
nMOS or NMOS ) technology developed in the 1970s. At that time MOS 
technology required fewer masking steps, was denser, and consumed less power 
than equivalent bipolar ICs. This meant that, for a given performance, an MOS 
IC was cheaper than a bipolar IC and led to investment and growth of the MOS 
IC market. 

By the early 1980s the aluminum gates of the transistors were replaced by 
poly silicon gates, but the name MOS remained. The introduction of polysilicon 
as a gate material was a major improvement in CMOS technology, making it 
easier to make two types of transistors, n -channel MOS and p -channel MOS 
transistors, on the same ICa complementary MOS ( CMOS , never cMOS) 
technology. The principal advantage of CMOS over NMOS is lower power 
consumption. Another advantage of a polysilicon gate was a simplification of the 
fabrication process, allowing devices to be scaled down in size. 

There are four CMOS transistors in a two-input NAND gate (and a two-input 
NOR gate too), so to convert between gates and transistors, you multiply the 
number of gates by 4 to obtain the number of transistors. We can also measure an 
IC by the smallest feature size (roughly half the length of the smallest transistor) 
imprinted on the IC. Transistor dimensions are measured in microns (a micron, 1 
m m, is a millionth of a meter). Thus we talk about a 0.5 m m IC or say an IC is 
built in (or with) a 0.5 m m process, meaning that the smallest transistors are 0.5 
m m in length. We give a special label, 1 or lambda , to this smallest feature size. 
Since lambda is equal to half of the smallest transistor length, 1 a 0.25 m m in a 
0.5 m m process. Many of the drawings in this book use a scale marked with 
lambda for the same reason we place a scale on a map. 

A modern submicron CMOS process is now just as complicated as a submicron 
bipolar or BiCMOS (a combination of bipolar and CMOS) process. However, 
CMOS ICs have established a dominant position, are manufactured in much 
greater volume than any other technology, and therefore, because of the economy 
of scale, the cost of CMOS ICs is less than a bipolar or BiCMOS IC for the same 
function. Bipolar and BiCMOS ICs are still used for special needs. For example, 
bipolar technology is generally capable of handling higher voltages than CMOS. 
This makes bipolar and BiCMOS ICs useful in power electronics, cars, telephone 




circuits, and so on. 



Some digital logic ICs and their analog counterparts (analog/digital converters, 
for example) are standard parts , or standard ICs. You can select standard ICs 
from catalogs and data books and buy them from distributors. Systems 
manufacturers and designers can use the same standard part in a variety of 
different microelectronic systems (systems that use microelectronics or ICs). 

With the advent of VLSI in the 1980s engineers began to realize the advantages 
of designing an IC that was customized or tailored to a particular system or 
application rather than using standard ICs alone. Microelectronic system design 
then becomes a matter of defining the functions that you can implement using 
standard ICs and then implementing the remaining logic functions (sometimes 
called glue logic ) with one or more custom ICs . As VLSI became possible you 
could build a system from a smaller number of components by combining many 
standard ICs into a few custom ICs. Building a microelectronic system with 
fewer ICs allows you to reduce cost and improve reliability. 

Of course, there are many situations in which it is not appropriate to use a custom 
IC for each and every part of an microelectronic system. If you need a large 
amount of memory, for example, it is still best to use standard memory ICs, 
either dynamic random-access memory ( DRAM or dRAM), or static RAM ( 
SRAM or sRAM), in conjunction with custom ICs. 

One of the first conferences to be devoted to this rapidly emerging segment of the 
IC industry was the IEEE Custom Integrated Circuits Conference (CICC), and 
the proceedings of this annual conference form a useful reference to the 
development of custom ICs. As different types of custom ICs began to evolve for 
different types of applications, these new ICs gave rise to a new term: 
application-specific IC, or ASIC. Now we have the IEEE International ASIC 
Conference , which tracks advances in ASICs separately from other types of 
custom ICs. Although the exact definition of an ASIC is difficult, we shall look at 
some examples to help clarify what people in the IC industry understand by the 
term. 

Examples of ICs that are not ASICs include standard parts such as: memory chips 
sold as a commodity itemROMs, DRAM, and SRAM; microprocessors; TTL or 
TTL-equivalent ICs at SSI, MSI, and LSI levels. 

Examples of ICs that are ASICs include: a chip for a toy bear that talks; a chip 
for a satellite; a chip designed to handle the interface between memory and a 
microprocessor for a workstation CPU; and a chip containing a microprocessor as 
a cell together with other logic. 

As a general rule, if you can find it in a data book, then it is probably not an 
ASIC, but there are some exceptions. For example, two ICs that might or might 
not be considered ASICs are a controller chip for a PC and a chip for a modem. 
Both of these examples are specific to an application (shades of an ASIC) but are 
sold to many different system vendors (shades of a standard part). ASICs such as 




these are sometimes called application-specific standard products ( ASSPs ). 



Trying to decide which members of the huge IC family are application-specific is 
trickyafter all, every IC has an application. For example, people do not usually 
consider an application-specific microprocessor to be an ASIC. I shall describe 
how to design an ASIC that may include large cells such as microprocessors, but 
I shall not describe the design of the microprocessors themselves. Defining an 
ASIC by looking at the application can be confusing, so we shall look at a 
different way to categorize the IC family. The easiest way to recognize people is 
by their faces and physical characteristics: tall, short, thin. The easiest 
characteristics of ASICs to understand are physical ones too, and we shall look at 
these next. It is important to understand these differences because they affect 
such factors as the price of an ASIC and the way you design an ASIC. 




1.1 Types of ASICs 

ICs are made on a thin (a few hundred microns thick), circular silicon wafer , 
with each wafer holding hundreds of die (sometimes people use dies or dice for 
the plural of die). The transistors and wiring are made from many layers (usually 
between 10 and 15 distinct layers) built on top of one another. Each successive 
mask layer has a pattern that is defined using a mask similar to a glass 
photographic slide. The first half-dozen or so layers define the transistors. The 
last half-dozen or so layers define the metal wires between the transistors (the 
interconnect ). 

A full-custom IC includes some (possibly all) logic cells that are customized and 
all mask layers that are customized. A microprocessor is an example of a 
full-custom ICdesigners spend many hours squeezing the most out of every last 
square micron of microprocessor chip space by hand. Customizing all of the IC 
features in this way allows designers to include analog circuits, optimized 
memory cells, or mechanical structures on an IC, for example. Full-custom ICs 
are the most expensive to manufacture and to design. The manufacturing lead 
time (the time it takes just to make an ICnot including design time) is typically 
eight weeks for a full-custom IC. These specialized full-custom ICs are often 
intended for a specific application, so we might call some of them full-custom 
ASICs. 

We shall discuss full-custom ASICs briefly next, but the members of the IC 
family that we are more interested in are semicustom ASICs , for which all of the 
logic cells are predesigned and some (possibly all) of the mask layers are 
customized. Using predesigned cells from a cell library makes our lives as 
designers much, much easier. There are two types of semicustom ASICs that we 
shall cover: standard-cellbased ASICs and gate-arraybased ASICs. Following 
this we shall describe the programmable ASICs , for which all of the logic cells 
are predesigned and none of the mask layers are customized. There are two types 
of programmable ASICs: the programmable logic device and, the newest member 
of the ASIC family, the field-programmable gate array. 

1.1.1 Full-Custom ASICs 



In a full-custom ASIC an engineer designs some or all of the logic cells, circuits, 
or layout specifically for one ASIC. This means the designer abandons the 
approach of using pretested and precharacterized cells for all or part of that 
design. It makes sense to take this approach only if there are no suitable existing 




cell libraries available that can be used for the entire design. This might be 
because existing cell libraries are not fast enough, or the logic cells are not small 
enough or consume too much power. You may need to use full-custom design if 
the ASIC technology is new or so specialized that there are no existing cell 
libraries or because the ASIC is so specialized that some circuits must be custom 
designed. Fewer and fewer full-custom ICs are being designed because of the 
problems with these special parts of the ASIC. There is one growing member of 
this family, though, the mixed analog/digital ASIC, which we shall discuss next. 

Bipolar technology has historically been used for precision analog functions. 
There are some fundamental reasons for this. In all integrated circuits the 
matching of component characteristics between chips is very poor, while the 
matching of characteristics between components on the same chip is excellent. 
Suppose we have transistors Tl, T2, and T3 on an analog/digital ASIC. The three 
transistors are all the same size and are constructed in an identical fashion. 
Transistors Tl and T2 are located adjacent to each other and have the same 
orientation. Transistor T3 is the same size as Tl and T2 but is located on the 
other side of the chip from Tl and T2 and has a different orientation. ICs are 
made in batches called wafer lots. A wafer lot is a group of silicon wafers that are 
all processed together. Usually there are between 5 and 30 wafers in a lot. Each 
wafer can contain tens or hundreds of chips depending on the size of the IC and 
the wafer. 

If we were to make measurements of the characteristics of transistors Tl, T2, and 
T3 we would find the following: 

• Transistors Tl will have virtually identical characteristics to T2 on the 
same IC. We say that the transistors match well or the tracking between 
devices is excellent. 

• Transistor T3 will match transistors Tl and T2 on the same IC very well, 
but not as closely as Tl matches T2 on the same IC. 

• Transistor Tl, T2, and T3 will match fairly well with transistors Tl, T2, 
and T3 on a different IC on the same wafer. The matching will depend on 
how far apart the two ICs are on the wafer. 

• Transistors on ICs from different wafers in the same wafer lot will not 
match very well. 

• Transistors on ICs from different wafer lots will match very poorly. 

For many analog designs the close matching of transistors is crucial to circuit 
operation. For these circuit designs pairs of transistors are used, located adjacent 
to each other. Device physics dictates that a pair of bipolar transistors will always 
match more precisely than CMOS transistors of a comparable size. Bipolar 
technology has historically been more widely used for full-custom analog design 
because of its improved precision. Despite its poorer analog properties, the use of 
CMOS technology for analog functions is increasing. There are two reasons for 
this. The first reason is that CMOS is now by far the most widely available IC 
technology. Many more CMOS ASICs and CMOS standard products are now 




being manufactured than bipolar ICs. The second reason is that increased levels 
of integration require mixing analog and digital functions on the same IC: this 
has forced designers to find ways to use CMOS technology to implement analog 
functions. Circuit designers, using clever new techniques, have been very 
successful in finding new ways to design analog CMOS circuits that can 
approach the accuracy of bipolar analog designs. 

1.1.2 Standard-CellBased ASICs 



A cell-based ASIC (cell-based IC, or CBIC a common term in Japan, 
pronounced sea-bick) uses predesigned logic cells (AND gates, OR gates, 
multiplexers, and flip-flops, for example) known as standard cells . We could 
apply the term CBIC to any IC that uses cells, but it is generally accepted that a 
cell-based ASIC or CBIC means a standard-cellbased ASIC. 

The standard-cell areas (also called flexible blocks) in a CBIC are built of rows 
of standard cellslike a wall built of bricks. The standard-cell areas may be used 
in combination with larger predesigned cells, perhaps microcontrollers or even 
microprocessors, known as megacells . Megacells are also called megafunctions, 
full-custom blocks, system-level macros (SLMs), fixed blocks, cores, or 
Functional Standard Blocks (FSBs). 

The ASIC designer defines only the placement of the standard cells and the 
interconnect in a CBIC. However, the standard cells can be placed anywhere on 
the silicon; this means that all the mask layers of a CBIC are customized and are 
unique to a particular customer. The advantage of CBICs is that designers save 
time, money, and reduce risk by using a predesigned, pretested, and 
precharacterized standard-cell library . In addition each standard cell can be 
optimized individually. During the design of the cell library each and every 
transistor in every standard cell can be chosen to maximize speed or minimize 
area, for example. The disadvantages are the time or expense of designing or 
buying the standard-cell library and the time needed to fabricate all layers of the 
ASIC for each new design. 

Figure 1.2 shows a CBIC (looking down on the die shown in Figure 1.1b, for 
example). The important features of this type of ASIC are as follows: 

• All mask layers are customizedtransistors and interconnect. 

• Custom blocks can be embedded. 

• Manufacturing lead time is about eight weeks. 




FIGURE 1.2 A cell-based ASIC 
(CBIC) die with a single 
standard-cell area (a flexible 
block) together with four fixed 
blocks. The flexible block 
contains rows of standard cells. 

This is what you might see 
through a low-powered 
microscope looking down on the 
die of Figure 1.1(b). The small 
squares around the edge of the die 
are bonding pads that are 
connected to the pins of the ASIC 
package. 

Each standard cell in the library is constructed using full-custom design methods, 
but you can use these predesigned and precharacterized circuits without having to 
do any full-custom design yourself. This design style gives you the same 
performance and flexibility advantages of a full-custom ASIC but reduces design 
time and reduces risk. 

Standard cells are designed to fit together like bricks in a wall. Figure 1.3 shows 
an example of a simple standard cell (it is simple in the sense it is not maximized 
for density but ideal for showing you its internal construction). Power and ground 
buses (VDD and GND or VSS) run horizontally on metal lines inside the cells. 
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FIGURE 1.3 Looking down on the layout of a standard cell. This cell would be 
approximately 25 microns wide on an ASIC with 1 (lambda) = 0.25 microns (a 
micron is 10 6 m). Standard cells are stacked like bricks in a wall; the abutment 
box (AB) defines the edges of the brick. The difference between the bounding 
box (BB) and the AB is the area of overlap between the bricks. Power supplies 
(labeled VDD and GND) run horizontally inside a standard cell on a metal layer 
that lies above the transistor layers. Each different shaded and labeled pattern 
represents a different layer. This standard cell has center connectors (the three 
squares, labeled Al, Bl, and Z) that allow the cell to connect to others. The 
layout was drawn using ROSE, a symbolic layout editor developed by Rockwell 
and Compass, and then imported into Tanner Researchs L-Edit. 

Standard-cell design allows the automation of the process of assembling an 
ASIC. Groups of standard cells fit horizontally together to form rows. The rows 
stack vertically to form flexible rectangular blocks (which you can reshape 
during design). You may then connect a flexible block built from several rows of 
standard cells to other standard-cell blocks or other full-custom logic blocks. For 
example, you might want to include a custom interface to a standard, predesigned 
microcontroller together with some memory. The microcontroller block may be a 
fixed-size megacell, you might generate the memory using a memory compiler, 
and the custom logic and memory controller will be built from flexible 
standard-cell blocks, shaped to fit in the empty spaces on the chip. 

Both cell-based and gate-array ASICs use predefined cells, but there is a 
differencewe can change the transistor sizes in a standard cell to optimize speed 
and performance, but the device sizes in a gate array are fixed. This results in a 
trade-off in performance and area in a gate array at the silicon level. The trade-off 
between area and performance is made at the library level for a standard-cell 
ASIC. 

Modern CMOS ASICs use two, three, or more levels (or layers) of metal for 
interconnect. This allows wires to cross over different layers in the same way that 
we use copper traces on different layers on a printed-circuit board. In a two-level 
metal CMOS technology, connections to the standard-cell inputs and outputs are 
usually made using the second level of metal ( metal2 , the upper level of metal) 
at the tops and bottoms of the cells. In a three-level metal technology, 
connections may be internal to the logic cell (as they are in Figure 1.3). This 
allows for more sophisticated routing programs to take advantage of the extra 
metal layer to route interconnect over the top of the logic cells. We shall cover 
the details of routing ASICs in Chapter 17. 

A connection that needs to cross over a row of standard cells uses a feedthrough. 
The term feedthrough can refer either to the piece of metal that is used to pass a 
signal through a cell or to a space in a cell waiting to be used as a feedthrough 
very confusing. Figure 1.4 shows two feedthroughs: one in cell A. 14 and one in 
cell A.23. 




In both two-level and three-level metal technology, the power buses (VDD and 
GND) inside the standard cells normally use the lowest (closest to the transistors) 
layer of metal ( metal 1 ). The width of each row of standard cells is adjusted so 
that they may be aligned using spacer cells . The power buses, or rails, are then 
connected to additional vertical power rails using row-end cells at the aligned 
ends of each standard-cell block. If the rows of standard cells are long, then 
vertical power rails can also be run in metal2 through the cell rows using special 
power cells that just connect to VDD and GND. Usually the designer manually 
controls the number and width of the vertical power rails connected to the 
standard-cell blocks during physical design. A diagram of the power distribution 
scheme for a CBIC is shown in Figure 1.4. 
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FIGURE 1.4 Routing the CBIC (cell-based IC) shown in Figure 1.2. The use of 
regularly shaped standard cells, such as the one in Figure 1.3, from a library 
allows ASICs like this to be designed automatically. This ASIC uses two 
separate layers of metal interconnect (metal 1 and metal2) running at right angles 
to each other (like traces on a printed-circuit board). Interconnections between 
logic cells uses spaces (called channels) between the rows of cells. ASICs may 
have three (or more) layers of metal allowing the cell rows to touch with the 
interconnect running over the top of the cells. 



All the mask layers of a CBIC are customized. This allows megacells (SRAM, a 
SCSI controller, or an MPEG decoder, for example) to be placed on the same IC 
with standard cells. Megacells are usually supplied by an ASIC or library 
company complete with behavioral models and some way to test them (a test 
strategy). ASIC library companies also supply compilers to generate flexible 
DRAM, SRAM, and ROM blocks. Since all mask layers on a standard-cell 
design are customized, memory design is more efficient and denser than for gate 
arrays. 





For logic that operates on multiple signals across a data busa datapath ( DP )the 
use of standard cells may not be the most efficient ASIC design style. Some 
ASIC library companies provide a datapath compiler that automatically generates 
datapath logic . A datapath library typically contains cells such as adders, 
subtracters, multipliers, and simple arithmetic and logical units ( ALUs ). The 
connectors of datapath library cells are pitch-matched to each other so that they 
fit together. Connecting datapath cells to form a datapath usually, but not always, 
results in faster and denser layout than using standard cells or a gate array. 

Standard-cell and gate-array libraries may contain hundreds of different logic 
cells, including combinational functions (NAND, NOR, AND, OR gates) with 
multiple inputs, as well as latches and flip-flops with different combinations of 
reset, preset and clocking options. The ASIC library company provides designers 
with a data book in paper or electronic form with all of the functional 
descriptions and timing information for each library element. 

1.1.3 Gate-ArrayBased ASICs 

In a gate array (sometimes abbreviated to GA) or gate-arraybased ASIC the 
transistors are predefined on the silicon wafer. The predefined pattern of 
transistors on a gate array is the base array , and the smallest element that is 
replicated to make the base array (like an M. C. Escher drawing, or tiles on a 
floor) is the base cell (sometimes called a primitive cell ). Only the top few layers 
of metal, which define the interconnect between transistors, are defined by the 
designer using custom masks. To distinguish this type of gate array from other 
types of gate array, it is often called a masked gate array ( MGA ). The designer 
chooses from a gate-array library of predesigned and precharacterized logic cells. 
The logic cells in a gate-array library are often called macros . The reason for this 
is that the base-cell layout is the same for each logic cell, and only the 
interconnect (inside cells and between cells) is customized, so that there is a 
similarity between gate-array macros and a software macro. Inside IBM, 
gate-array macros are known as books (so that books are part of a library), but 
unfortunately this descriptive term is not very widely used outside IBM. 

We can complete the diffusion steps that form the transistors and then stockpile 
wafers (sometimes we call a gate array a prediffused array for this reason). Since 
only the metal interconnections are unique to an MGA, we can use the stockpiled 
wafers for different customers as needed. Using wafers prefabricated up to the 
metallization steps reduces the time needed to make an MGA, the turnaround 
time , to a few days or at most a couple of weeks. The costs for all the initial 
fabrication steps for an MGA are shared for each customer and this reduces the 
cost of an MGA compared to a full-custom or standard-cell ASIC design. 

There are the following different types of MGA or gate-arraybased ASICs: 

• Channeled gate arrays. 

• Channelless gate arrays. 




• Structured gate arrays. 



The hyphenation of these terms when they are used as adjectives explains their 
construction. For example, in the term channeled gate-array architecture, the 
gate array is channeled , as will be explained. There are two common ways of 
arranging (or arraying) the transistors on a MGA: in a channeled gate array we 
leave space between the rows of transistors for wiring; the routing on a 
channelless gate array uses rows of unused transistors. The channeled gate array 
was the first to be developed, but the channelless gate-array architecture is now 
more widely used. A structured (or embedded) gate array can be either channeled 
or channelless but it includes (or embeds) a custom block. 

1.1.4 Channeled Gate Array 

Figure 1.5 shows a channeled gate array . The important features of this type of 
MGA are: 

• Only the interconnect is customized. 

• The interconnect uses predefined spaces between rows of base cells. 

• Manufacturing lead time is between two days and two weeks. 



FIGURE 1.5 A channeled gate-array die. 
The spaces between rows of the base cells 
are set aside for interconnect. 




A channeled gate array is similar to a CBICboth use rows of cells separated by 
channels used for interconnect. One difference is that the space for interconnect 
between rows of cells are fixed in height in a channeled gate array, whereas the 
space between rows of cells may be adjusted in a CBIC. 

1.1.5 Channelless Gate Array 

Figure 1.6 shows a channelless gate array (also known as a channel-free gate 
array , sea-of-gates array , or SOG array). The important features of this type of 
MGA are as follows: 

• Only some (the top few) mask layers are customizedthe interconnect. 

• Manufacturing lead time is between two days and two weeks. 





FIGURE 1.6 A channelless gate-array or 
sea-of-gates (SOG) array die. The core 
area of the die is completely filled with an 
array of base cells (the base array). 
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The key difference between a channelless gate array and channeled gate array is 
that there are no predefined areas set aside for routing between cells on a 
channelless gate array. Instead we route over the top of the gate-array devices. 

We can do this because we customize the contact layer that defines the 
connections between metal 1, the first layer of metal, and the transistors. When 
we use an area of transistors for routing in a channelless array, we do not make 
any contacts to the devices lying underneath; we simply leave the transistors 
unused. 

The logic densitythe amount of logic that can be implemented in a given silicon 
areais higher for channelless gate arrays than for channeled gate arrays. This is 
usually attributed to the difference in structure between the two types of array. In 
fact, the difference occurs because the contact mask is customized in a 
channelless gate array, but is not usually customized in a channeled gate array. 
This leads to denser cells in the channelless architectures. Customizing the 
contact layer in a channelless gate array allows us to increase the density of 
gate-array cells because we can route over the top of unused contact sites. 

1.1.6 Structured Gate Array 

An embedded gate array or structured gate array (also known as masterslice or 
masterimage ) combines some of the features of CBICs and MGAs. One of the 
disadvantages of the MGA is the fixed gate-array base cell. This makes the 
implementation of memory, for example, difficult and inefficient. In an 
embedded gate array we set aside some of the IC area and dedicate it to a specific 
function. This embedded area either can contain a different base cell that is more 
suitable for building memory cells, or it can contain a complete circuit block, 
such as a microcontroller. 

Figure 1.7 shows an embedded gate array. The important features of this type of 
MGA are the following: 

• Only the interconnect is customized. 

• Custom blocks (the same for each design) can be embedded. 

• Manufacturing lead time is between two days and two weeks. 





FIGURE 1 .7 A structured or 
embedded gate-array die showing 
an embedded block in the upper 
left corner (a static random-access 
memory, for example). The rest of 
the die is filled with an array of 
base cells. 




An embedded gate array gives the improved area efficiency and increased 
performance of a CBIC but with the lower cost and faster turnaround of an MGA. 
One disadvantage of an embedded gate array is that the embedded function is 
fixed. For example, if an embedded gate array contains an area set aside for a 32 
k-bit memory, but we only need a 16 k-bit memory, then we may have to waste 
half of the embedded memory function. However, this may still be more efficient 
and cheaper than implementing a 32 k-bit memory using macros on a SOG array. 

ASIC vendors may offer several embedded gate array structures containing 
different memory types and sizes as well as a variety of embedded functions. 
ASIC companies wishing to offer a wide range of embedded functions must 
ensure that enough customers use each different embedded gate array to give the 
cost advantages over a custom gate array or CBIC (the Sun Microsystems 
SPARCstation 1 described in Section 1.3 made use of LSI Logic embedded gate 
arraysand the 10K and 100K series of embedded gate arrays were two of LSI 
Logics most successful products). 

1.1.7 Programmable Logic Devices 

Programmable logic devices ( PLDs ) are standard ICs that are available in 
standard configurations from a catalog of parts and are sold in very high volume 
to many different customers. However, PLDs may be configured or programmed 
to create a part customized to a specific application, and so they also belong to 
the family of ASICs. PLDs use different technologies to allow programming of 
the device. Figure 1.8 shows a PLD and the following important features that all 
PLDs have in common: 

• No customized mask layers or logic cells 

• Fast design turnaround 

• A single large block of programmable interconnect 

• A matrix of logic macrocells that usually consist of programmable array 
logic followed by a flip-flop or latch 




FIGURE 1.8 A programmable 
logic device (PLD) die. The 
macrocells typically consist of 
programmable array logic 
followed by a flip-flop or latch. 
The macrocells are connected 
using a large programmable 
interconnect block. 




The simplest type of programmable IC is a read-only memory ( ROM ). The most 
common types of ROM use a metal fuse that can be blown permanently (a 
programmable ROM or PROM ). An electrically programmable ROM , or 
EPROM , uses programmable MOS transistors whose characteristics are altered 
by applying a high voltage. You can erase an EPROM either by using another 
high voltage (an electrically erasable PROM , or EEPROM ) or by exposing the 
device to ultraviolet light ( UV-erasable PROM , or UVPROM ). 

There is another type of ROM that can be placed on any ASICa 
mask-programmable ROM (mask-programmed ROM or masked ROM). A 
masked ROM is a regular array of transistors permanently programmed using 
custom mask patterns. An embedded masked ROM is thus a large, specialized, 
logic cell. 

The same programmable technologies used to make ROMs can be applied to 
more flexible logic structures. By using the programmable devices in a large 
array of AND gates and an array of OR gates, we create a family of flexible and 
programmable logic devices called logic arrays . The company Monolithic 
Memories (bought by AMD) was the first to produce Programmable Array Logic 
(PAL ® , a registered trademark of AMD) devices that you can use, for example, 
as transition decoders for state machines. A PAL can also include registers 
(flip-flops) to store the current state information so that you can use a PAL to 
make a complete state machine. 

Just as we have a mask-programmable ROM, we could place a logic array as a 
cell on a custom ASIC. This type of logic array is called a programmable logic 
array (PLA). There is a difference between a PAL and a PLA: a PLA has a 
programmable AND logic array, or AND plane , followed by a programmable 
OR logic array, or OR plane ; a PAL has a programmable AND plane and, in 
contrast to a PLA, a fixed OR plane. 

Depending on how the PLD is programmed, we can have an erasable PLD 
(EPLD), or mask-programmed PLD (sometimes called a masked PLD but usually 
just PLD). The first PALs, PLAs, and PLDs were based on bipolar technology 
and used programmable fuses or links. CMOS PLDs usually employ 
floating-gate transistors (see Section 4.3, EPROM and EEPROM Technology). 





1.1.8 Field-Programmable Gate Arrays 

A step above the PLD in complexity is the field-programmable gate array ( 

FPGA ). There is very little difference between an FPGA and a PLDan FPGA is 
usually just larger and more complex than a PLD. In fact, some companies that 
manufacture programmable ASICs call their products FPGAs and some call them 
complex PLDs . FPGAs are the newest member of the ASIC family and are 
rapidly growing in importance, replacing TTL in microelectronic systems. Even 
though an FPGA is a type of gate array, we do not consider the term gate-array 
based ASICs to include FPGAs. This may change as FPGAs and MGAs start to 
look more alike. 

Figure 1.9 illustrates the essential characteristics of an FPGA: 

• None of the mask layers are customized. 

• A method for programming the basic logic cells and the interconnect. 

• The core is a regular array of programmable basic logic cells that can 
implement combinational as well as sequential logic (flip-flops). 

• A matrix of programmable interconnect surrounds the basic logic cells. 

• Programmable I/O cells surround the core. 

• Design turnaround is a few hours. 

We shall examine these features in detail in Chapters 48. 



FIGURE 1.9 A field-programmable 
gate array (FPGA) die. All FPGAs 
contain a regular structure of 
programmable basic logic cells 
surrounded by programmable 
interconnect. The exact type, size, 
and number of the programmable 
basic logic cells varies 
tremendously. 
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1.2 Design Flow 

Figure 1.10 shows the sequence of steps to design an ASIC; we call this a design 
flow . The steps are listed below (numbered to correspond to the labels in 
Figure 1.10) with a brief description of the function of each step. 



start 




FIGURE 1.10 ASIC design flow. 

1. Design entry. Enter the design into an ASIC design system, either using a 
hardware description language ( HDL ) or schematic entry . 

2. Logic synthesis. Use an HDL (VHDL or Verilog) and a logic synthesis 
tool to produce a netlist a description of the logic cells and their 
connections. 

3. System partitioning. Divide a large system into ASIC-sized pieces. 

4. Prelayout simulation. Check to see if the design functions correctly. 

5. Floorplanning. Arrange the blocks of the netlist on the chip. 

6. Placement. Decide the locations of cells in a block. 
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1 .3 Case Study 

Sun Microsystems released the SPARCstation 1 in April 1989. It is now an old 
design but a very important example because it was one of the first workstations 
to make extensive use of ASICs to achieve the following: 

• Better performance at lower cost 

• Compact size, reduced power, and quiet operation 

• Reduced number of parts, easier assembly, and improved reliability 

The SPARCstation 1 contains about 50 ICs on the system motherboardexcluding 
the DRAM used for the system memory (standard parts). The SPARCstation 1 
designers partitioned the system into the nine ASICs shown in Table 1.1 and 
wrote specifications for each ASICthis took about three months 1_. LSI Logic 
and Fujitsu designed the SPARC integer unit (IU) and floating-point unit ( FPU ) 
to these specifications. The clock ASIC is a fairly straightforward design and, of 
the six remaining ASICs, the video controller/data buffer, the RAM controller, 
and the direct memory access ( DMA ) controller are defined by the 32-bit 
system bus ( SBus ) and the other ASICs that they connect to. The rest of the 
system is partitioned into three more ASICs: the cache controller , 
memory-management unit (MMU), and the data buffer. These three ASICs, with 
the IU and FPU, have the most critical timing paths and determine the system 
partitioning. The design of ASICs 38 in Table 1.1 took five Sun engineers six 
months after the specifications were complete. During the design process, the 
Sun engineers simulated the entire SPARCstation 1 including execution of the 
Sun operating system (SunOS). 

TABLE 1.1 The ASICs in the Sun Microsystems SPARCstation 1. 



SPARCstation 1 ASIC Gates (k-gates) 

1 SPARC integer unit (IU) 20 

2 SPARC floating-point unit (FPU) 50 

3 Cache controller 9 

4 Memory-management unit (MMU) 5 

5 Data buffer 3 

6 Direct memory access (DMA) controller 9 

7 Video controller/data buffer 4 

8 RAM controller 1 

9 Clock generator 1 



Table 1.2 shows the software tools used to design the SPARCstation 1, many of 
which are now obsolete. The important point to notice, though, is that there is a 
lot more to microelectronic system design than designing the ASICsless than 
one-third of the tools listed in Table 1.2 were ASIC design tools. 



TABLE 1.2 The CAD tools used in the design of the Sun Microsystems 
SPARCstation 1. 



Design level 
ASIC design 



Board design 



Function 

ASIC physical design 

ASIC logic synthesis 

ASIC simulation 
Schematic capture 
PCB layout 

Timing verification 



Mechanical design Case and enclosure 

Thermal analysis 
Structural analysis 
Management Scheduling 

Documentation 



TooU 

LSI Logic 

Internal tools and UC Berkeley 
tools 

LSI Logic 
Valid Logic 
Valid Logic Allegro 

Quad Design Motive and 
internal tools 

Autocad 
Pacific Numerix 
Cosmos 
Suntrac 

Interleaf and FrameMaker 



The SPARCstation 1 cost about $9000 in 1989 or, since it has an execution rate 
of approximately 12 million instructions per second (MIPS), $750/MIPS. Using 
ASIC technology reduces the motherboard to about the size of a piece of paper 
8.5 inches by 11 incheswith a power consumption of about 12 W. The 
SPARCstation 1 pizza box is 16 inches across and 3 inches highsmaller than a 
typical IBM-compatible personal computer in 1989. This speed, power, and size 
performance is (there are still SPARCstation Is in use) made possible by using 
ASICs. We shall return to the SPARCstation 1, to look more closely at the 
partitioning step, in Section 15.3, System Partitioning. 



1. Some information in Section 1.3 and Section 15.3 is from the 
SPARCstation 10 Architecture GuideMay 1992, p. 2 and pp. 2728 and from two 
publicity brochures (known as sparkle sheets). The first is Concept to System: 
How Sun Microsystems Created SPARCstation 1 Using LSI Logic's ASIC 
System Technology, A. Bechtolsheim, T. Westberg, M. Insley, and J. Ludemann 
of Sun Microsystems; J-H. Huang and D. Boyle of LSI Logic. This is an LSI 
Logic publication. The second paper is SPARCstation 1: Beyond the 3M 
Horizon, A. Bechtolsheim and E. Frank, a Sun Microsystems publication. I did 
not include these as references since they are impossible to obtain now, but I 
would like to give credit to Andy Bechtolsheim and the Sun Microsystems and 
LSI Logic engineers. 



1.4 Economics of ASICs 



In this section we shall discuss the economics of using ASICs in a product and 
compare the most popular types of ASICs: an FPGA, an MGA, and a CBIC. To 
make an economic comparison between these alternatives, we consider the ASIC 
itself as a product and examine the components of product cost: fixed costs and 
variable costs. Making cost comparisons is dangerouscosts change rapidly and 
the semiconductor industry is notorious for keeping its costs, prices, and pricing 
strategy closely guarded secrets. The figures in the following sections are 
approximate and used to illustrate the different components of cost. 

1 .4.1 Comparison Between ASIC 
Technologies 

The most obvious economic factor in making a choice between the different 
ASIC types is the part cost . Part costs vary enormouslyyou can pay anywhere 
from a few dollars to several hundreds of dollars for an ASIC. In general, 
however, FPGAs are more expensive per gate than MGAs, which are, in turn, 
more expensive than CBICs. For example, a 0.5 m m, 20 k-gate array might cost 
0.010.02 cents/gate (for more than 10,000 parts) or $2$4 per part, but an 
equivalent FPGA might be $20. The price per gate for an FPGA to implement the 
same function is typically 25 times the cost of an MGA or CBIC. 

Given that an FPGA is more expensive than an MGA, which is more expensive 
than a CBIC, when and why does it make sense to choose a more expensive part? 
Is the increased flexibility of an FPGA worth the extra cost per part? Given that 
an MGA or CBIC is specially tailored for each customer, there are extra hidden 
costs associated with this step that we should consider. To make a true 
comparison between the different ASIC technologies, we shall quantify some of 
these costs. 

1 .4.2 Product Cost 



The total cost of any product can be separated into fixed costs and variable costs : 



total product cost = fixed product cost + variable product cost ¥ products 
sold 



Fixed costs are independent of sales volume the number of products sold. 




However, the fixed costs amortized per product sold (fixed costs divided by 
products sold) decrease as sales volume increases. Variable costs include the cost 
of the parts used in the product, assembly costs, and other manufacturing costs. 

Let us look more closely at the parts in a product. If we want to buy ASICs to 
assemble our product, the total part cost is 

total part cost = fixed part cost + variable cost per part¥ volume of parts. (1.2) 

Our fixed cost when we use an FPGA is lowwe just have to buy the software and 
any programming equipment. The fixed part costs for an MGA or CBIC are 
higher and include the costs of the masks, simulation, and test program 
development. We shall discuss these extra costs in more detail in Sections 1.4.3 
and 1.4.4. Figure 1.11 shows a break-even graph that compares the total part cost 
for an FPGA, MGA, and a CBIC with the following assumptions: 

• FPGA fixed cost is $21,800, part cost is $39. 

• MGA fixed cost is $86,000, part cost is $10. 

• CBIC fixed cost is $146,000, part cost is $8. 

At low volumes, the MGA and the CBIC are more expensive because of their 
higher fixed costs. The total part costs of two alternative types of ASIC are equal 
at the break-even volume . In Figure 1.11 the break-even volume for the FPGA 
and the MGA is about 2000 parts. The break-even volume between the FPGA 
and the CBIC is about 4000 parts. The break-even volume between the MGA and 
the CBIC is higherat about 20,000 parts. 

cost ofparts 




FIGURE 1 . 1 1 A break-even analysis for an FPGA, a masked gate array (MGA) 
and a custom cell-based ASIC (CBIC). The break-even volume between two 
technologies is the point at which the total cost of parts are equal. These 
numbers are very approximate. 



We shall describe how to calculate the fixed part costs next. Following that we 




shall discuss how we came up with cost per part of $39, $10, and $8 for the 
FPGA, MGA, and CBIC. 



1.4.3 ASIC Fixed Costs 

Figure 1.12 shows a spreadsheet, Fixed Costs, that calculates the fixed part costs 
associated with ASIC design. 
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FIGURE 1.12 A spreadsheet, Fixed Costs, for a field-programmable gate array 
(FPGA), a masked gate array (MGA), and a cell-based ASIC (CBIC). These 
costs can vary wildly. 

The training cost includes the cost of the time to learn any new electronic design 
automation ( EDA ) system. For example, a new FPGA design system might 
require a few days to learn; a new gate-array or cell-based design system might 
require taking a course. Figure 1.12 assumes that the cost of an engineer 
(including overhead, benefits, infrastructure, and so on) is between $100,000 and 
$200,000 per year or $2000 to $4000 per week (in the United States in 1990s 
dollars). 

Next we consider the hardware and software cost for ASIC design. Figure 1.12 
shows some typical figures, but you can spend anywhere from $1000 to 
$1 million (and more) on ASIC design software and the necessary infrastructure. 

We try to measure productivity of an ASIC designer in gates (or transistors) per 
day. This is like trying to predict how long it takes to dig a hole, and the number 




of gates per day an engineer averages varies wildly. ASIC design productivity 
must increase as ASIC sizes increase and will depend on experience, design 
tools, and the ASIC complexity. If we are using similar design methods, design 
productivity ought to be independent of the type of ASIC, but FPGA design 
software is usually available as a complete bundle on a PC. This means that it is 
often easier to learn and use than semicustom ASIC design tools. 

Every ASIC has to pass a production test to make sure that it works. With 
modern test tools the generation of any test circuits on each ASIC that are needed 
for production testing can be automatic, but it still involves a cost for design for 
test . An FPGA is tested by the manufacturer before it is sold to you and before 
you program it. You are still paying for testing an FPGA, but it is a hidden cost 
folded into the part cost of the FPGA. You do have to pay for any programming 
costs for an FPGA, but we can include these in the hardware and software cost. 

The nonrecurring-engineering ( NRE ) charge includes the cost of work done by 
the ASIC vendor and the cost of the masks. The production test uses sets of test 
inputs called test vectors , often many thousands of them. Most ASIC vendors 
require simulation to generate test vectors and test programs for production 
testing, and will charge for a test-program development cost . The number of 
masks required by an ASIC during fabrication can range from three or four (for a 
gate array) to 15 or more (for a CBIC). Total mask costs can range from $5000 to 
$50,000 or more. The total NRE charge can range from $10,000 to $300,000 or 
more and will vary with volume and the size of the ASIC. If you commit to high 
volumes (above 100,000 parts), the vendor may waive the NRE charge. The NRE 
charge may also include the costs of software tools, design verification, and 
prototype samples. 

If your design does not work the first time, you have to complete a further design 
pass ( turn or spin ) that requires additional NRE charges. Normally you sign a 
contract (sign off a design) with an ASIC vendor that guarantees first-pass 
successthis means that if you designed your ASIC according to rules specified 
by the vendor, then the vendor guarantees that the silicon will perform according 
to the simulation or you get your money back. This is why the difference between 
semicustom and full-custom design styles is so importantthe ASIC vendor will 
not (and cannot) guarantee your design will work if you use any full-custom 
design techniques. 

Nowadays it is almost routine to have an ASIC work on the first pass. However, 
if your design does fail, it is little consolation to have a second pass for free if 
your company goes bankrupt in the meantime. Figure 1.13 shows a profit model 
that represents the profit flow during the product lifetime . Using this model, we 
can estimate the lost profit due to any delay. 





FIGURE 1.13 A profit model. If a product is introduced on time, the total sales 
are $60 million (the area of the higher triangle). With a three-month (one fiscal 
quarter) delay the sales decline to $25 million. The difference is shown as the 
shaded area between the two triangles and amounts to a lost revenue of 
$35 million. 

Suppose we have the following situation: 

• The product lifetime is 18 months (6 fiscal quarters). 

• The product sales increase (linearly) at $10 million per quarter 
independently of when the product is introduced (we suppose this is 
because we can increase production and sales only at a fixed rate). 

• The product reaches its peak sales at a point in time that is independent of 
when we introduce a product (because of external market factors that we 
cannot control). 

• The product declines in sales (linearly) to the end of its lifea point in time 
that is also independent of when we introduce the product (again due to 
external market forces). 

The simple profit and revenue model of Figure 1.13 shows us that we would lose 
$35 million in sales in this situation due to a 3-month delay. Despite the obvious 
problems with such a simple model (how can we introduce the same product 
twice to compare the performance?), it is widely used in marketing. In the 
electronics industry product lifetimes continue to shrink. In the PC industry it is 
not unusual to have a product lifetime of 18 months or less. This means that it is 
critical to achieve a rapid design time (or high product velocity ) with no delays. 

The last fixed cost shown in Figure 1.12 corresponds to an insurance policy. 
When a company buys an ASIC part, it needs to be assured that it will always 
have a back-up source, or second source , in case something happens to its first or 
primary source. Established FPGA companies have a second source that 
produces equivalent parts. With a custom ASIC you may have to do some 
redesign to transfer your ASIC to the second source. However, for all ASIC 
types, switching production to a second source will involve some cost. 

Figure 1.12 assumes a second-source cost of $2000 for all types of ASIC (the 
amount may be substantially more than this). 




1.4.4 ASIC Variable Costs 



Figure 1.14 shows a spreadsheet, Variable Costs, that calculates some example 
part costs. This spreadsheet uses the terms and parameters defined below the 
figure. 
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Design 
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10,000 gates 


Density 


10,000 


20,000 


25,000 gates/sq.cm 
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Die size 
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Die4wafer 


S3 
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Defect density 
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0.90 
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65 


72 


00 % 


Die cost 


25 


7 
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Profit margin 


60 


45 


50 % 


Price/gate 


0.39 


0.10 


0.03 cents 


Part cost 


$39 
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FIGURE 1.14 A spreadsheet, Variable Costs, to calculate the part cost (that is 
the variable cost for a product using ASICs) for different ASIC technologies. 

• The wafer size increases every few years. From 1985 to 1990, 4-inch to 
6-inch diameter wafers were common; equipment using 6-inch to 8-inch 
wafers was introduced between 1990 and 1995; the next step is the 300 cm 
or 12-inch wafer. The 12-inch wafer will probably take us to 2005. 

• The wafer cost depends on the equipment costs, process costs, and 
overhead in the fabrication line. A typical wafer cost is between $1000 and 
$5000, with $2000 being average; the cost declines slightly during the life 
of a process and increases only slightly from one process generation to the 
next. 

• Moores Law (after Gordon Moore of Intel) models the observation that 
the number of transistors on a chip roughly doubles every 18 months. Not 
all designs follow this law, but a large ASIC design seems to grow by a 
factor of 10 every 5 years (close to Moores Law). In 1990 a large ASIC 
design size was 10 k-gate, in 1995 a large design was about 100 k-gate, in 
2000 it will be 1 M-gate, in 2005 it will be 10 M-gate. 

• The gate density is the number of gate equivalents per unit area 
(remember: a gate equivalent, or gate, corresponds to a two-input NAND 
gate). 

• The gate utilization is the percentage of gates that are on a die that we can 
use (on a gate array we waste some gate space for interconnect). 

• The die size is determined by the design size (in gates), the gate density, 




and the utilization of the die. 

• The number of die per wafer depends on the die size and the wafer size 
(we have to pack rectangular or square die, together with some test chips, 
on to a circular wafer so some space is wasted). 

• The defect density is a measure of the quality of the fabrication process. 
The smaller the defect density the less likely there is to be a flaw on any 
one die. A single defect on a die is almost always fatal for that die. Defect 
density usually increases with the number of steps in a process. A defect 
density of less than 1 cm 2 is typical and required for a submicron CMOS 
process. 

• The yield of a process is the key to a profitable ASIC company. The yield 
is the fraction of die on a wafer that are good (expressed as a percentage). 
Yield depends on the complexity and maturity of a process. A process may 
start out with a yield of close to zero for complex chips, which then climbs 
to above 50 percent within the first few months of production. Within a 
year the yield has to be brought to around 80 percent for the average 
complexity ASIC for the process to be profitable. Yields of 90 percent or 
more are not uncommon. 

• The die cost is determined by wafer cost, number of die per wafer, and the 
yield. Of these parameters, the most variable and the most critical to 
control is the yield. 

• The profit margin (what you sell a product for, less what it costs you to 
make it, divided by the cost) is determined by the ASIC companys fixed 
and variable costs. ASIC vendors that make and sell custom ASICs have 
huge fixed and variable costs associated with building and running 
fabrication facilities (a fabrication plant is a fab ). FPGA companies are 
typically fabless they do not own a fabthey must pass on the costs of the 
chip manufacture (plus the profit margin of the chip manufacturer) and the 
development cost of the FPGA structure in the FPGA part cost. The 
profitability of any company in the ASIC business varies greatly. 

• The price per gate (usually measured in cents per gate) is determined by 
die costs and design size. It varies with design size and declines over time. 

• The part cost is determined by all of the preceding factors. As such it will 
vary widely with time, process, yield, economic climate, ASIC size and 
complexity, and many other factors. 

As an estimate you can assume that the price per gate for any process technology 
falls at about 20 % per year during its life (the average life of a CMOS process is 
24 years, and can vary widely). Beyond the life of a process, prices can increase 
as demand falls and the fabrication equipment becomes harder to maintain. 

Figure 1.15 shows the price per gate for the different ASICs and process 
technologies using the following assumptions: 

• For any new process technology the price per gate decreases by 40 % in 
the first year, 30 % in the second year, and then remains constant. 




• A new process technology is introduced approximately every 2 years, with 
feature size decreasing by a factor of two every 5 years as follows: 2mm 
in 1985, 1.5 m m in 1987, 1 m m in 1989, 0.80.6 m m in 19911993, 0.5 
0.35 m m in 19961997, 0.250. 18 m m in 19982000. 

• CBICs and MGAs are introduced at approximately the same time and 
price. 

• The price of a new process technology is initially 10 % above the process 
that it replaces. 

• FPGAs are introduced one year after CBICs that use the same process 
technology. 

• The initial FPGA price (per gate) is 10 percent higher than the initial price 
for CBICs or MGAs using the same process technology. 

From Figure 1.15 you can see that the successive introduction of new process 
technologies every 2 years drives the price per gate down at a rate close to 30 
percent per year. The cost figures that we have used in this section are very 
approximate and can vary widely (this means they may be off by a factor of 2 but 
probably are correct within a factor of 10). ASIC companies do use spreadsheet 
models like these to calculate their costs. 
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FIGURE 1.15 Example price per gate figures. 

Having decided if, and then which, ASIC technology is appropriate, you need to 
choose the appropriate cell library. Next we shall discuss the issues surrounding 
ASIC cell libraries: the different types, their sources, and their contents. 




1.5 ASIC Cell Libraries 



The cell library is the key part of ASIC design. For a programmable ASIC the 
FPGA company supplies you with a library of logic cells in the form of a design 
kit , you normally do not have a choice, and the cost is usually a few thousand 
dollars. For MGAs and CBICs you have three choices: the ASIC vendor (the 
company that will build your ASIC) will supply a cell library, or you can buy a 
cell library from a third-party library vendor , or you can build your own cell 
library. 

The first choice, using an ASIC-vendor library , requires you to use a set of 
design tools approved by the ASIC vendor to enter and simulate your design. 

You have to buy the tools, and the cost of the cell library is folded into the NRE. 
Some ASIC vendors (especially for MGAs) supply tools that they have 
developed in-house. For some reason the more common model in Japan is to use 
tools supplied by the ASIC vendor, but in the United States, Europe, and 
elsewhere designers want to choose their own tools. Perhaps this has to do with 
the relationship between customer and supplier being a lot closer in Japan than it 
is elsewhere. 

An ASIC vendor library is normally a phantom library the cells are empty boxes, 
or phantoms , but contain enough information for layout (for example, you would 
only see the bounding box or abutment box in a phantom version of the cell in 
Figure 1.3). After you complete layout you hand off a netlist to the ASIC vendor, 
who fills in the empty boxes ( phantom instantiation ) before manufacturing your 
chip. 

The second and third choices require you to make a buy-or-build decision . If you 
complete an ASIC design using a cell library that you bought, you also own the 
masks (the tooling ) that are used to manufacture your ASIC. This is called 
customer-owned tooling ( COT , pronounced see-oh-tee). A library vendor 
normally develops a cell library using information about a process supplied by an 
ASIC foundry . An ASIC foundry (in contrast to an ASIC vendor) only provides 
manufacturing, with no design help. If the cell library meets the foundry 
specifications, we call this a qualified cell library . These cell libraries are 
normally expensive (possibly several hundred thousand dollars), but if a library is 
qualified at several foundries this allows you to shop around for the most 
attractive terms. This means that buying an expensive library can be cheaper in 
the long run than the other solutions for high- volume production. 



The third choice is to develop a cell library in-house. Many large computer and 




electronics companies make this choice. Most of the cell libraries designed today 
are still developed in-house despite the fact that the process of library 
development is complex and very expensive. 

However created, each cell in an ASIC cell library must contain the following: 

• A physical layout 

• A behavioral model 

• A Verilog/VHDL model 

• A detailed timing model 

• A test strategy 

• A circuit schematic 

• A cell icon 

• A wire-load model 

• A routing model 

For MGA and CBIC cell libraries we need to complete cell design and cell layout 
and shall discuss this in Chapter 2. The ASIC designer may not actually see the 
layout if it is hidden inside a phantom, but the layout will be needed eventually. 

In a programmable ASIC the cell layout is part of the programmable ASIC 
design (see Chapter 4). 

The ASIC designer needs a high-level, behavioral model for each cell because 
simulation at the detailed timing level takes too long for a complete ASIC design. 
For a NAND gate a behavioral model is simple. A multiport RAM model can be 
very complex. We shall discuss behavioral models when we describe Verilog and 
VHDL in Chapter 10 and Chapter 11. The designer may require Verilog and 
VHDL models in addition to the models for a particular logic simulator. 

ASIC designers also need a detailed timing model for each cell to determine the 
performance of the critical pieces of an ASIC. It is too difficult, too 
time-consuming, and too expensive to build every cell in silicon and measure the 
cell delays. Instead library engineers simulate the delay of each cell, a process 
known as characterization . Characterizing a standard-cell or gate-array library 
involves circuit extraction from the full-custom cell layout for each cell. The 
extracted schematic includes all the parasitic resistance and capacitance elements. 
Then library engineers perform a simulation of each cell including the parasitic 
elements to determine the switching delays. The simulation models for the 
transistors are derived from measurements on special chips included on a wafer 
called process control monitors ( PCMs ) or drop-ins . Library engineers then use 
the results of the circuit simulation to generate detailed timing models for logic 
simulation. We shall cover timing models in Chapter 13. 

All ASICs need to be production tested (programmable ASICs may be tested by 
the manufacturer before they are customized, but they still need to be tested). 
Simple cells in small or medium-size blocks can be tested using automated 
techniques, but large blocks such as RAM or multipliers need a planned strategy. 




We shall discuss test in Chapter 14. 



The cell schematic (a netlist description) describes each cell so that the cell 
designer can perform simulation for complex cells. You may not need the 
detailed cell schematic for all cells, but you need enough information to compare 
what you think is on the silicon (the schematic) with what is actually on the 
silicon (the layout)this is a layout versus schematic ( LVS ) check. 

If the ASIC designer uses schematic entry, each cell needs a cell icon together 
with connector and naming information that can be used by design tools from 
different vendors. We shall cover ASIC design using schematic entry in 
Chapter 9. One of the advantages of using logic synthesis (Chapter 12) rather 
than schematic design entry is eliminating the problems with icons, connectors, 
and cell names. Logic synthesis also makes moving an ASIC between different 
cell libraries, or retargeting , much easier. 

In order to estimate the parasitic capacitance of wires before we actually 
complete any routing, we need a statistical estimate of the capacitance for a net in 
a given size circuit block. This usually takes the form of a look-up table known as 
a wire-load model . We also need a routing model for each cell. Large cells are 
too complex for the physical design or layout tools to handle directly and we 
need a simpler representationa phantom of the physical layout that still contains 
all the necessary information. The phantom may include information that tells the 
automated routing tool where it can and cannot place wires over the cell, as well 
as the location and types of the connections to the cell. 




1.6 Summary 

In this chapter we have looked at the difference between full-custom ASICs, 
semi-custom ASICs, and programmable ASICs. Table 1.3 summarizes their 
different features. ASICs use a library of predesigned and precharacterized logic 
cells. In fact, we could define an ASIC as a design style that uses a cell library 
rather than in terms of what an ASIC is or what an ASIC does. 



TABLE 1.3 Types of ASIC. 



ASIC type 


Family member 


Custom mask Custom logic 
layers cells 


Full-custom 


Analog/digital 


All 


Some 


Semicustom 


Cell-based (CBIC) 


All 


None 




Masked gate array (MGA) 


Some 


None 


Programmable 


Field-programmable gate array 
(FPGA) 


None 


None 




Programmable logic device (PLD) 


None 


None 



You can think of ICs like pizza. A full-custom pizza is built from scratch. You 
can customize all the layers of a CBIC pizza, but from a predefined selection, and 
it takes a while to cook. An MGA pizza uses precooked crusts with fixed sizes 
and you choose only from a few different standard types on a menu. This makes 
MGA pizza a little faster to cook and a little cheaper. An FPGA is rather like a 
frozen pizzayou buy it at the supermarket in a limited selection of sizes and 
types, but you can put it in the microwave at home and it will be ready in a few 
minutes. 

In each chapter we shall indicate the key concepts. In this chapter they are 

• The difference between full-custom and semicustom ASICs 

• The difference between standard-cell, gate-array, and programmable 
ASICs 

• The ASIC design flow 

• Design economics including part cost, NRE, and breakeven volume 

• The contents and use of an ASIC cell library 



Next, in Chapter 2, we shall take a closer look at the semicustom ASICs that 
were introduced in this chapter. 




1.7 Problems 



1.1 (Break-even volumes, 60 min.) You need a spreadsheet program (such as 
Microsoft Excel) for this problem. 

• a. Build a spreadsheet, Break-even Analysis, to generate Figure 1.11. 

• b. Derive equations for the break-even volumes (there are three: 
FPGA/MGA, FPGA/CBIC, and MGA/CBIC) and calculate their values. 

• c. Increase the FPGA part cost by $10 and use your spreadsheet to produce 
the new break-even graph. Hint: (For users of Excel-like spreadsheets) use 
the XY scatter plot option. Use the first column for the x -axis data. 

• d. Find the new break-even volumes (change the volume until the cost 
becomes the same for two technologies). 

• e. Program your spreadsheet to automatically find the break-even volumes. 
Now graph the break-even volume (for a choice between FPGA and CBIC) 
for values of FPGA part costs ranging from $10$50 and CBIC costs 
ranging from $2$ 10 (do not change the fixed costs from Figure 1.12). 

• f. Calculate the sensitivity of the break-even volumes to changes in the part 
costs and fixed costs. There are three break-even volumes and each of 
these is sensitive to two part costs and two fixed costs. Express your 
answers in two ways: in equation form and as numbers (for the values in 
Section 1.4.2 and Figure 1.11). 

• g. The costs in Figure 1.11 are not unrealistic. What can you say from your 
answers if you are a defense contractor, primarily selling products in 
volumes of less than 1000 parts? What if you are a PC board vendor 
selling between 10,000 and 100,000 parts? 

1.2 (Design productivity, 10 min.) Given the figures for the SPARCstation 1 
ASICs described in Section 1.3 what was the productivity measured in 
transistors/day? and measured in gates/day? Compare your answers with the 
figures for productivity in Section 1.4.3 and explain any differences. How 
accurate do you think productivity estimates are? 

1.3 (ASIC package size, 30 min.) Assuming, for this problem, a gate density of 
1.0 gate/mil 2 (see Section 15.4, Estimating ASIC Size, for a detailed 
explanation of this figure), the maximum number of gates you can put in a 
package is determined by the maximum die size for each of the packages shown 
in Table 1.4. The maximum die size is determined by the package cavity size; 
these are package-limited ASICs. Calculate the maximum number of I/O pads 
that can be placed on a die for each package if the pad spacing is: (i) 5 mil, and 




(ii) 10 mil. Compare your answers with the maximum numbers of pins (or leads) 
on each package and comment. Now calculate the minimum number of gates that 
you can put in each package determined by the minimum die size. 

TABLE 1.4 Die size limits for ASIC packages. 



Package 


^ Number of pins or 
— leads 


Maximum die size_2 
(mil 2 ) 


Minimum die sizejl 
(mil 2 ) 


PLCC 


44 


320 ¥320 


94 ¥94 


PLCC 


68 


420 ¥420 


154 ¥ 154 


PLCC 


84 


395 ¥ 395 


171 ¥ 171 


PQFP 


100 


338 ¥338 


124 ¥ 124 


PQFP 


144 


350 ¥350 


266 ¥ 266 


PQFP 


160 


429 ¥ 429 


248 ¥ 248 


PQFP 


208 


501 ¥501 


427 ¥ 427 


CPGA 


68 


480 ¥480 


200 ¥200 


CPGA 


84 


370 ¥370 


200 ¥200 


CPGA 


120 


480 ¥480 


175 ¥ 175 


CPGA 


144 


470 ¥470 


250 ¥250 


CPGA 


223 


590 ¥590 


290 ¥290 


CPGA 


299 


590 ¥590 


470 ¥470 


PPGA 


64 


230 ¥230 


120 ¥ 120 


PPGA 


84 


380 ¥380 


150 ¥ 150 


PPGA 


100 


395 ¥ 395 


150 ¥ 150 


PPGA 


120 


395 ¥ 395 


190 ¥ 190 


PPGA 


144 


660 ¥655 


230 ¥230 


PPGA 


180 


540 ¥540 


330 ¥330 


PPGA 


208 


500 ¥500 


395 ¥ 395 



1.4 (ASIC vendor costs, 30 min.) There is a well-known saying in the ASIC 
business: We lose money on every partbut we make it up in volume. This has a 
serious side. Suppose Sumo Silicon currently has two customers: Mr. Big, who 
currently buys 10,000 parts per week, and Ms. Smart, who currently buys 4800 
parts per week. A new customer, Ms. Teeny (who is growing fast), wants to buy 
1200 parts per week. Sumos costs are 

wafer cost = $500 + ($250,000/ W ), 

where W is the number of wafer starts per week. Assume each wafer carries 200 
chips (parts), all parts are identical, and the yield is 

yield = 70 + 0.2 ¥ ( W 80) % (1.3) 

Currently Sumo has a profit margin of 35 percent. Sumo is currently running at 
100 wafer starts per week for Mr. Big and Ms. Smart. Sumo thinks they can get 



50 cents more out of Mr. Big for his chips, but Ms. Smart wont pay any more. 

We can calculate how much Sumo can afford to lose per chip if they want 
Ms. Teenys business really badly. 

• a. What is Sumos current yield? 

• b. How many good parts is Sumo currently producing per week? ( Hint: Is 
this enough to supply Mr. Big and Ms. Smart?) 

• c. Calculate how many extra wafer starts per week we need to supply 
Ms. Teeny (the yield will changewhat is the new yield?). Think when you 
give this answer. 

• d. What is Sumos increase in costs to supply Ms. Teeny? 

• e. Multiply your answer to part d by 1.35 (to account for Sumos profit). 
This is the increase in revenue we need to cover our increased costs to 
supply Ms. Teeny. 

• f. Now suppose we charge Mr. Big 50 cents more per part. How much 
extra revenue does that generate? 

• g. How much does Ms. Teenys extra business reduce the wafer cost? 

• h. How much can Sumo Silicon afford to lose on each of Ms. Teenys 
parts, cover its costs, and still make a 35 percent profit? 

1.5 (Silicon, 20 min.) How much does a 6-inch silicon wafer weigh? a 12-inch 
wafer? How much does a carrier (called a boat) that holds twenty 12-inch wafers 
weigh? What implications does this have for manufacturing? 

• a. How many die that are 1-inch on a side does a 12-inch wafer hold? If 
each die is worth $100, how much is a 20-wafer boat worth? If a factory is 
processing 10 of these boats in different furnaces when the power is 
interrupted and those wafers have to be scrapped, how much money is 
lost? 

• b. The size of silicon factories (fabs or foundries) is measured in wafer 
starts per week. If a factory is capable of 5000 12-inch wafer starts per 
week, with an average die of 500 mil on a side that sells for $20 and 90 
percent yield, what is the value in dollars/year of the factory production? 
What fraction of the current gross national (or domestic) product 
(GNP/GDP) of your country is that? If the yield suddenly drops from 90 
percent to 40 percent (a yield bust) how much revenue is the company 
losing per day? If the company has a cash reserve of $100 million and this 
revenue loss drops straight to the bottom line, how long does it take for 
the company to go out of business? 

• c. TSMC produced 2 million 6-inch wafers in 1996, how many 500 mil die 
is that? TSMCs $500 million Camas fab in Washington is scheduled to 
produce 30,000 8-inch wafers per month by the year 2000 using a 0.35 mm 
process. If a 1 Mb SRAM yields 1500 good die per 8-inch wafer and there 
are 1700 gross die per wafer, what is the yield? What is the die size? If the 
SRAM cell size is 7 mm 2 , what fraction of the die is used by the cells? 
What is TSMCs cost per bit for SRAM if the wafer cost is $2000? If a 




16Mb DRAM on the same fab line uses a 16 mm 2 die, what is the cost per 
bit for DRAM assuming the same yield? 

1.6 (Simulation time, 30 min.) . . . The system-level simulation used 
approximately 4000 lines of SPARC assembly language . . . each simulation 
clock was simulated in three real time seconds (Sun Technology article). 

• a. With a 20 MHz clock how much slower is simulated time than real 
time? 

• b. How long would it take to simulate all 4000 lines of test code? (Assume 
one line of assembly code per cyclea good approximation compared to the 
others we are making.) 

The article continues: the entire system was simulated, running actual code, 
including several milliseconds of SunOS execution. Four days after power-up, 
SPARCstation 1 booted SunOS and announced: 'hello world' . 

• c. How long would it take to simulate 5 ms of code? 

• d. Find out how long it takes to boot a UNIX workstation in real time. How 
many clock cycles is this? 

• e. The machine is not executing boot code all this time; you have to wait 
for disk drives to spin-up, file systems checks to complete, and so on. 

Make some estimates as to how much code is required to boot an operating 
system (OS) and how many clock cycles this would take to execute. 

The number of clock cycles you need to simulate to boot a system is somewhere 
between your answers to parts d and e. 

• f. From your answers make an estimate of how long it takes to simulate 
booting the OS. Does this seem reasonable? 

• g. Could the engineers have simulated a complete boot sequence? 

• h. Do you think the engineers expected the system to boot on first silicon, 
given the complexity of the system and how long they would have to wait 
to simulate a complete boot sequence? Explain. 

1.7 (Price per gate, 5 min.) Given the assumptions of Section 1.4.4 on the price 
per gate of different ASIC technologies, what has to change for the price per gate 
for an FPGA to be less than that for an MGA or CBICif all three use the same 
process? 

1.8 (Pentiums, 20 min.) Read the online tour of the Pentium Pro at 
http://www.intel.com (adapted from a paper presented at the 1995 International 
Solid-State Circuits Conference). This is not an ASIC design; notice the section 
on full-custom circuit design. Notice also the comments on the use of 'assert' 
statements in the HDL code that described the circuits. Find out the approximate 
cost of the Intel Pentium (3.3 million transistors) and Pentium Pro (5.5 million 
transistors) microprocessors. 

• a. Assuming there a four transistors per gate equivalent, what is the price 



per gate? 

• b. Find out the cost of a 1 Mb, 4 Mb, 8 Mb, or 16 Mb DRAM. Assuming 
one transistor per memory bit, what is the price per gate of DRAM? 

• c. Considering that both have roughly the same die size, are just as 
complex to design and to manufacture, why is there such a huge difference 
in price per gate between microprocessors and DRAM? 

1.9 (Inverse embedded arrays, 10 min.) A relatively new cousin of the embedded 
gate array, the inverse-embedded gate array , is a cell-based ASIC that contains 
an embedded gate-array megacell. List the features as well as the advantages and 
disadvantages of this type of ASIC in the same way as for the other members of 
the ASIC family in Section 1.1. 

1.10 (0.5-gate design, 60 min.) It is a good idea to complete a 0.5-gate ASIC 
design (an inverter connected between an input pad and an output pad) in the first 
week (day) of class. Capture the commands in a report that shows all the steps 
taken to create your chip starting from an empty directory halfgate . 

1.11 (Filenames, 30 min.) Start a list of filename extensions used in ASIC design. 
Table 1.5 shows an example. Expand this list as you use more tools. 

TABLE 1.5 CAD tool filename extensions. 

Extension Description From To 

Viewlogic startup file, 

.ini library Viewlogic/Viewdraw Internal tools use 

search paths, etc. other Viewlogic tools 

.wir Schematic file 



1 . PLCC = plastic leaded chip carrier, PQFP = plastic quad flat pack, CPGA = 
ceramic pin-grid array, PPGA = plastic pin-grid array. 

2. Maximum die size is not standard and varies between manufacturers. 

3. Minimum die size is an estimate based on bond length restrictions. 
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CMOS LOGIC 



A CMOS transistor (or device) has four terminals: gate , source , drain , and a 
fourth terminal that we shall ignore until the next section. A CMOS transistor is a 
switch. The switch must be conducting or on to allow current to flow between the 
source and drain terminals (using open and closed for switches is confusingfor 
the same reason we say a tap is on and not that it is closed ). The transistor source 
and drain terminals are equivalent as far as digital signals are concernedwe do 
not worry about labeling an electrical switch with two terminals. 

• V ab the potential difference, or voltage, between nodes A and B in a 
circuit; V AB is positive if node A is more positive than node B. 

• Italics denote variables; constants are set in roman (upright) type. 
Uppercase letters denote DC, large-signal, or steady-state voltages. 

• For TTL the positive power supply is called VCC (V cc or V cc )• The 'C' 
denotes that the supply is connected indirectly to the collectors of the npn 
bipolar transistors (a bipolar transistor has a collector, base, and emitter 
corresponding roughly to the drain, gate, and source of an MOS 
transistor). 

• Following the example of TTL we used VDD (V DD or V DD ) to denote 
the positive supply in an NMOS chip where the devices are all n -channel 
transistors and the drains of these devices are connected indirectly to the 
positive supply. The supply nomenclature for NMOS chips has stuck for 
CMOS. 

• VDD is the name of the power supply node or net; V DD represents the 
value (uppercase since V DD is a DC quantity). Since V DD is a variable, it 
is italic (words and multiletter abbreviations use romanthus it is V DD , but 

V drain )• 

• Logic designers often call the CMOS negative supply VSS or VSS even if 
it is actually ground or GND. I shall use VSS for the node and V ss for the 
value. 

• CMOS uses positive logic VDD is logic T and VSS is logic 'O'. 

We turn a transistor on or off using the gate terminal. There are two kinds of 
CMOS transistors: n -channel transistors and p -channel transistors. An n 
-channel transistor requires a logic T (from now on 111 just say a T) on the gate 




to make the switch conducting (to turn the transistor on ). A p -channel transistor 
requires a logic 'O' (again from now on, 111 just say a '0') on the gate to make the 
switch nonconducting (to turn the transistor off ). The p -channel transistor 
symbol has a bubble on its gate to remind us that the gate has to be a 'O' to turn 
the transistor on . All this is shown in Figure 2.1(a) and (b). 



.rrctia nnel transistor 
drain 

gate — IE 



source 



p-cha nnel transistor 
source 

gate— HE 

drain 



'1 




on 



'O' 






(a) 



(to 





F = 



FIGURE 2.1 CMOS transistors as switches, (a) An n -channel transistor, (b) A p 
-channel transistor, (c) A CMOS inverter and its symbol (an equilateral triangle 
and a circle ). 



If we connect an n -channel transistor in series with a p -channel transistor, as 
shown in Figure 2.1(c), we form an inverter . With four transistors we can form a 
two-input NAND gate (Figure 2.2a). We can also make a two-input NOR gate 
(Figure 2.2b). Logic designers normally use the terms NAND gate and logic gate 
(or just gate), but I shall try to use the terms NAND cell and logic cell rather than 
NAND gate or logic gate in this chapter to avoid any possible confusion with the 
gate terminal of a transistor. 
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FIGURE 2.2 CMOS logic, (a) A two-input NAND logic cell, (b) A two-input 
NOR logic cell. The n -channel and p -channel transistor switches implement the 
T's and '0's of a Karnaugh map. 
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2.1 CMOS Transistors 



Figure 2.3 illustrates how electrons and holes abandon their dopant atoms leaving 
a depletion region around a transistors source and drain. The region between 
source and drain is normally nonconducting. To make an n -channel transistor 
conducting, we must apply a positive voltage V GS (the gate voltage with respect 
to the source) that is greater than the n -channel transistor threshold voltage , V t n 
(a typical value is 0.5 V and, as far as we are presently concerned, is a constant). 

o 

This establishes a thin ( a 50 A) conducting channel of electrons under the gate. 
MOS transistors can carry a very small current (the subthreshold current a few 
microamperes or less) with V GS < V t n , but we shall ignore this. A transistor 
can be conducting ( V GS > V t n ) without any current flowing. To make current 
flow in an n -channel transistor we must also apply a positive voltage, V DS , to 
the drain with respect to the source. Figure 2.3 shows these connections and the 
connection to the fourth terminal of an MOS transistorthe bulk ( well , tub , or 
substrate ) terminal. For an n -channel transistor we must connect the bulk to the 
most negative potential, GND or VSS, to reverse bias the bulk-to-drain and 
bulk-to-source pn -diodes. The arrow in the four-terminal n -channel transistor 
symbol in Figure 2.3 reflects the polarity of these pn -diodes. 




FIGURE 2.3 An n -channel MOS transistor. The gate-oxide thickness, T QX , is 
approximately 100 angstroms (0.01 m m). A typical transistor length, L = 2 1 . 
The bulk may be either the substrate or a well. The diodes represent pn 
-junctions that must be reverse-biased. 

The current flowing in the transistor is 

current (amperes) = charge (coulombs) per unit time (second). (2.1) 




We can express the current in terms of the total charge in the channel, Q (imagine 
taking a picture and counting the number of electrons in the channel at that 
instant). If t f (for time of flight sometimes called the transit time ) is the time 

that it takes an electron to cross between source and drain, the drain-to-source 
current, I DSn , is 

1 DSn - Q / 1 f • (2.2) 



We need to find Q and t f . The velocity of the electrons v (a vector) is given by 
the equation that forms the basis of Ohms law: 

v = m n E , (2.3) 

where m n is the electron mobility ( m p is the hole mobility ) and E is the electric 
field (with units Vm 1 ). 

Typical carrier mobility values are m n = 5001000 cm 2 V 1 s 1 and m p = 100 

400 cm 2 V 1 s 1 . Equation 2.3 is a vector equation, but we shall ignore the 
vertical electric field and concentrate on the horizontal electric field, E x , that 
moves the electrons between source and drain. The horizontal component of the 
electric field is E x = V DS / L, directed from the drain to the source, where L is 
the channel length (see Figure 2.3). The electrons travel a distance L with 
horizontal velocity v x = m n E x , so that 

L L 2 

t f = = .(2.4) 

V x m n V DS 

Next we find the channel charge, Q . The channel and the gate form the plates of 
a capacitor, separated by an insulatorthe gate oxide. We know that the charge on 
a linear capacitor, C, is Q = C V . Our lower plate, the channel, is not a linear 
conductor. Charge only appears on the lower plate when the voltage between the 
gate and the channel, V Gc , exceeds the n -channel threshold voltage. For our 
nonlinear capacitor we need to modify the equation for a linear capacitor to the 
following: 

Q = C(V gc V t n ) . (2.5) 

The lower plate of our capacitor is resistive and conducting current, so that the 
potential in the channel, V GG , varies. In fact, V GG = V GS at the source and V 
GC ~ V gs V DS at the drain. What we really should do is find an expression for 
the channel charge as a function of channel voltage and sum (integrate) the 
charge all the way across the channel, from x = 0 (at the source) to x = L (at the 
drain). Instead we shall assume that the channel voltage, V Gc ( x ), is a linear 
function of distance from the source and take the average value of the charge, 
which is thus 




Q = C[(V gs V tn ) 0.5V DS ] • (2.6) 



The gate capacitance, C , is given by the formula for a parallel-plate capacitor 
with length L , width W , and plate separation equal to the gate-oxide thickness, 

T ox . Thus the gate capacitance is 

WL e ox 

C= = WLC ox , (2.7) 

T 0 x 

where e ox is the gate-oxide dielectric permittivity. For silicon dioxide, SiO 2 , e ox 
a 3.45 ¥ 10 11 Fm 1 , so that, for a typical gate-oxide thickness of 100 A (1 A = 1 
angstrom = 0.1 nm), the gate capacitance per unit area, C ox a 3fFmm 2 . 

Now we can express the channel charge in terms of the transistor parameters, 

Q = WL C ox [ ( V GS V tn ) 0.5V DS 1 • (2.8) 

Finally, the drainsource current is 
1 DSn = Q/ 1 f 

= (W/L) m n C ox [ ( V GS V tn ) 0.5 V DS ] V DS 
= (W/L)k n [ ( V GS V tn ) 0.5 V DS 1 V DS . (2.9) 

The constant k ' n is the process transconductance parameter (or intrinsic 
transconductance ): 

k n - m n C ox . (2.10) 

We also define b n , the transistor gain factor (or just gain factor ) as 
bn = k' n (W/L). (2.11) 



The factor W/L (transistor width divided by length) is the transistor shape factor . 

Equation 2.9 describes the linear region (or triode region) of operation. This 
equation is valid until V D s = V G s V t n and then predicts that I D s decreases 
with increasing V D s > which does not make physical sense. At V D s = V G s V t 
n = V ds (sat) (the saturation voltage ) there is no longer enough voltage between 
the gate and the drain end of the channel to support any channel charge. Clearly a 
small amount of charge remains or the current would go to zero, but with very 
little free charge the channel resistance in a small region close to the drain 
increases rapidly and any further increase in V D s is dropped over this region. 

Thus for V ds > V GS V t n (the saturation region , or pentode region, of 
operation) the drain current IDS remains approximately constant at the saturation 




current , I DSn (sat) , where 

1 DSn (sat) = ( b n /2 X V GS V t n ) 2 ’ V GS > V t n • ( 2 - 12) 

Figure 2.4 shows the n -channel transistor I DS V DS characteristics for a generic 
0.5 m m CMOS process that we shall call G5 . We can fit Eq. 2.12 to the 
long-channel transistor characteristics (W = 60 m m, L = 6 m m) in Figure 2.4(a). 
If I DSn (sat) = 2 -5 mA (with V DS = 3.0 V, V GS — 3.0 V, V t n = 0.65 V, T ox 

o 

=100 A), the intrinsic transconductance is 
2(L/W) I DSn (sat) 

k'n= (2-13) 

(V GS V tn )2 

2 (6/60) (2.5 ¥ 10 3 ) 

(3.0 0.65) 2 
= 9.05 ¥ 10 5 AV 2 

or approximately 90 m AV 2 . This value of k ' n , calculated in the saturation 
region, will be different (typically lower by a factor of 2 or more) from the value 
of k ’ n measured in the linear region. We assumed the mobility, m n , and the 
threshold voltage, V t n , are constantsneither of which is true, as we shall see in 
Section 2.1.2. 

For the p -channel transistor in the G5 process, I DSp ( sa t) = 850 m A ( V DS = 

3.0 V, V GS = 3.0 V, V t p = 0.85 V, W = 60 m m, L = 6 m m). Then 

2 (LAV) ( I DSp (sat) ) 

k p = (2.14) 

(V GS V tp ) 2 

2 (6/60) (850 ¥10 6 ) 

(3.0 (0.85)) 2 
= 3.68 ¥ 10 5 AV 2 



The next section explains the signs in Eq. 2.14. 
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FIGURE 2.4 MOS n -channel 
transistor characteristics for a generic 
0.5 m m process (G5). (a) A 
short-channel transistor, with W = 6 m 
m and L = 0.6 m ni (drawn) and a 
long-channel transistor (W = 60 m m, 
L = 6 m m) (b) The 6/0.6 
characteristics represented as a 
surface, (c) A long-channel transistor 
obeys a square-law characteristic 
between I DS and V GS in the 



(c) 



saturation region ( V DS = 3 V). A 



short-channel transistor shows a more 
linear characteristic due to velocity 
saturation. Normally, all of the 
transistors used on an ASIC have short 
channels. 







2.1.1 P-Channel Transistors 



The source and drain of CMOS transistors look identical; we have to know which 
way the current is flowing to distinguish them. The source of an n -channel 
transistor is lower in potential than the drain and vice versa for a p -channel 
transistor. In an n -channel transistor the threshold voltage, V t n , is normally 

positive, and the terminal voltages V DS and V GS are also usually positive. In a p 
-channel transistor V t p is normally negative and we have a choice: We can write 

everything in terms of the magnitudes of the voltages and currents or we can use 
negative signs in a consistent fashion. 

Here are the equations for a p -channel transistor using negative signs: 




(2.15) 



_ k ' p (W/L) [ ( v GS V t P ) 0.5 V DS ] v DS ; v DS > V GS 
DSp -y 

v tp 

I DSp (sat) — b p /2 ( V Gs V tp )2 ; V DS <V GS V tp . 

In these two equations V t p is negative, and the terminal voltages V DS and V GS 
are also normally negative (and 3 V < 2 V, for example). The current I DSp is 

then negative, corresponding to conventional current flowing from source to 
drain of a p -channel transistor (and hence the negative sign for I DSp ( sat) in 

Eq. 2.14). 

2.1.2 Velocity Saturation 

For a deep submicron transistor, Eq. 2.12 may overestimate the drainsource 
current by a factor of 2 or more. There are three reasons for this error. First, the 
threshold voltage is not constant. Second, the actual length of the channel (the 
electrical or effective length, often written as L eff ) is less than the drawn (mask) 

length. The third reason is that Eq. 2.3 is not valid for high electric fields. The 
electrons cannot move any faster than about v max n = 10 5 ms 1 when the electric 

field is above 10 6 Vm 1 (reached when 1 V is dropped across 1mm); the 
electrons become velocity saturated . In this case t f = L eff / v max n , the drain 

source saturation current is independent of the transistor length, and Eq. 2.12 
becomes 

t Wv max n C ox ( V G s V t n ) , V DS V ds (sat) (velocity 

1 DSn (sat) — . . (2.16) 

saturated). 

We can see this behavior for the short-channel transistor characteristics in 
Figure 2.4(a) and (c). 

Transistor current is often specified per micron of gate width because of the form 
of Eq. 2. 16. As an example, suppose I DSn ( sat) / W = 300 m A m m 1 for the n 
-channel transistors in our G5 process (with V DS = 3.0 V, V GS = 3.0 V, V t n = 
0.65 V, L eff =0.5mm and T ox = 100 A). Then E x a (3 0.65) V / 0.5 m m a 5 V 
mm 1 , 

1 DSn (sat) ™ 

v max n — (2. 17) 

Cox(V GS V tn ) 

(300 ¥ 10 6 ) (1¥ 10 6 ) 



(3.45 ¥ 10 3 ) (3 0.65) 




= 37,000 ms 1 



and t f a 0.5 m m/37,000 ms 1 a 13 ps. 

The value for v max n is lower than the 10 5 ms 1 we expected because the carrier 
velocity is also lowered by mobility degradation due the vertical electric field 
which we have ignored. This vertical field forces the carriers to keep bumping 
in to the interface between the silicon and the gate oxide, slowing them down. 

2.1.3 SPICE Models 

The simulation program SPICE (which stands for Simulation Program with 
Integrated Circuit Emphasis ) is often used to characterize logic cells. Table 2.1 
shows a typical set of model parameters for our G5 process. The SPICE 
parameter KP (given in m AV 2 ) corresponds to k ' n (and k ' p ). SPICE 
parameters VT0 and TOX correspond to V t n (and V t p ), and T ox . SPICE 

parameter U0 (given in cm 2 V 1 s 1 ) corresponds to the ideal bulk mobility 
values, m n (and m p ). Many of the other parameters model velocity saturation 

and mobility degradation (and thus the effective value of k ' n and k ' p ). 

TABLE 2.1 SPICE parameters for a generic 0.5 m m process, G5 (0.6 m m 
drawn gate length). The n-channel transistor characteristics are shown in 
Figure 2.4. 

.MODEL CMOSN NMOS LEVEL=3 PHI=0.7 TOX=10E-09 XJ=0.2U TPG=1 
VTO=0.65 DELTA=0.7 

+ LD=5E-08 KP=2E-04 UO=550 THETA=0.27 RSH=2 GAMMA=0.6 
NSUB=1.4E+17 NFS=6E+11 

+ VMAX=2E+05 ETA=3.7E-02 KAPPA=2.9E-02 CGDO=3.0E-10 

CGSO=3.0E-10 CGBO=4.0E-10 

+ CJ=5.6E-04 MJ=0.56 CJSW=5E-11 MJSW=0.52 PB=1 

.MODEL CMOSP PMOS LEVEL=3 PHI=0.7 TOX=10E-09 XJ=0.2U TPG=-1 

VTO=-0.92 DELTA=0.29 

+ LD=3.5E-08 KP=4.9E-05 UO=135 THETA=0.18 RSH=2 GAMMA=0.47 
NSUB=8.5E+16 NFS=6.5E+11 

+ VMAX=2.5E+05 ETA=2.45E-02 KAPPA=7.96 CGDO=2.4E-10 
CGSO=2.4E-10 CGBO=3.8E-10 

+ CJ=9.3E-04 MJ=0.47 CJSW=2.9E-10 MJSW=0.505 PB=1 

2.1.4 Logic Levels 

Figure 2.5 shows how to use transistors as logic switches. The bulk connection 
for the n -channel transistor in Figure 2.5(ab) is a p -well. The bulk connection 
for the p -channel transistor is an n -well. The remaining connections show what 
happens when we try and pass a logic signal between the drain and source 
terminals. 






(a) 
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(c) (d) 

FIGURE 2.5 CMOS logic levels, (a) A strong 'O', (b) A weak T. (c) A weak 'O', 
(d) A strong T. ( V t n is positive and V t p is negative.) The depth of the 

channels is greatly exaggerated. 

In Figure 2.5(a) we apply a logic T (or VDD I shall use these interchangeably) 
to the gate and a logic 'O’ ( V ss ) to the source (we know it is the source since 
electrons must flow from this point, since V ss is the lowest voltage on the chip). 
The application of these voltages makes the n -channel transistor conduct current, 
and electrons flow from source to drain. 

Suppose the drain is initially at logic T; then the n -channel transistor will begin 
to discharge any capacitance that is connected to its drain (due to another logic 
cell, for example). This will continue until the drain terminal reaches a logic 'O', 
and at that time V GD and V Gs are both equal to V DD , a full logic T. The 

transistor is strongly conducting now (with a large channel charge, Q , but there 







is no current flowing since V DS = 0 V). The transistor will strongly object to 

attempts to change its drain terminal from a logic 'O'. We say that the logic level 
at the drain is a strong 'O'. 

In Figure 2.5(b) we apply a logic T to the drain (it must now be the drain since 
electrons have to flow toward a logic T). The situation is now quite differentthe 
transistor is still on but V GS is decreasing as the source voltage approaches its 
final value. In fact, the source terminal never gets to a logic Tthe source will 
stop increasing in voltage when V Gs reaches V t n . At this point the transistor is 
very nearly off and the source voltage creeps slowly up to V DD V t n . Because 

the transistor is very nearly off, it would be easy for a logic cell connected to the 
source to change the potential there, since there is so little channel charge. The 
logic level at the source is a weak T. Figure 2.5(cd) show the state of affairs for 
a p -channel transistor is the exact reverse or complement of the n -channel 
transistor situation. 

In summary, we have the following logic levels: 

• An n -channel transistor provides a strong 'O', but a weak T. 

• A p -channel transistor provides a strong T, but a weak ’O'. 

Sometimes we refer to the weak versions of 'O’ and T as degraded logic levels . 
In CMOS technology we can use both types of transistor together to produce 
strong '0' logic levels as well as strong T logic levels. 




2.2 The CMOS Process 



Figure 2.6 outlines the steps to create an integrated circuit. The starting material 
is silicon, Si, refined from quartzite (with less than 1 impurity in 10 10 silicon 
atoms). We draw a single-crystal silicon boule (or ingot) from a crucible 
containing a melt at approximately 1500 °C (the melting point of silicon at 1 atm. 
pressure is 1414 °C). This method is known as Czochralski growth. Acceptor ( p 
-type) or donor ( n -type) dopants may be introduced into the melt to alter the 
type of silicon grown. 

The boule is sawn to form thin circular wafers (6, 8, or 12 inches in diameter, and 
typically 600 m m thick), and a flat is ground (the primary flat), perpendicular to 
the <1 10> crystal axisas a this edge down indication. The boule is drawn so 
that the wafer surface is either in the (1 1 1) or (100) crystal planes. A smaller 
secondary flat indicates the wafer crystalline orientation and doping type. A 
typical submicron CMOS processes uses p -type (100) wafers with a resistivity of 
approximately 10 W cmthis type of wafer has two flats, 90° apart. Wafers are 
made by chemical companies and sold to the IC manufacturers. A blank 8-inch 
wafer costs about $100. 

To begin IC fabrication we place a batch of wafers (a wafer lot ) on a boat and 
grow a layer (typically a few thousand angstroms) of silicon dioxide , SiO 2 , 
using a furnace. Silicon is used in the semiconductor industry not so much for the 
properties of silicon, but because of the physical, chemical, and electrical 
properties of its native oxide, SiO 2 . An IC fabrication process contains a series 

of masking steps (that in turn contain other steps) to create the layers that define 
the transistors and metal interconnect. 





FIGURE 2.6 IC fabrication. Grow crystalline silicon (1); make a wafer (23); 
grow a silicon dioxide (oxide) layer in a furnace (4); apply liquid photoresist 
(resist) (5); mask exposure (6); a cross-section through a wafer showing the 
developed resist (7); etch the oxide layer (8); ion implantation (910); strip the 
resist (11); strip the oxide (12). Steps similar to 412 are repeated for each layer 
(typically 1220 times for a CMOS process). 

Each masking step starts by spinning a thin layer (approximately 1 m m) of liquid 
photoresist ( resist ) onto each wafer. The wafers are baked at about 100 °C to 
remove the solvent and harden the resist before being exposed to ultraviolet (UV) 
light (typically less than 200 nm wavelength) through a mask . The UV light 
alters the structure of the resist, allowing it to be removed by developing. The 
exposed oxide may then be etched (removed). Dry plasma etching etches in the 
vertical direction much faster than it does horizontally (an anisotropic etch). Wet 
etch techniques are usually isotropic . The resist functions as a mask during the 
etch step and transfers the desired pattern to the oxide layer. 

Dopant ions are then introduced into the exposed silicon areas. Figure 2.6 
illustrates the use of ion implantation . An ion implanter is a cross between a TV 
and a mass spectrometer and fires dopant ions into the silicon wafer. Ions can 
only penetrate materials to a depth (the range , normally a few microns) that 
depends on the closely controlled implant energy (measured in keVusually 
between 10 and 100 keV; an electron volt, 1 eV, is 1.6 ¥ 10 19 J). By using layers 
of resist, oxide, and poly silicon we can prevent dopant ions from reaching the 
silicon surface and thus block the silicon from receiving an implant . We control 
the doping level by counting the number of ions we implant (by integrating the 
ion-beam current). The implant dose is measured in atoms/cm 2 (typical doses are 
from 10 13 to 10 15 cm 2 ). As an alternative to ion implantation we may instead 
strip the resist and introduce dopants by diffusion from a gaseous source in a 
furnace. 

Once we have completed the transistor diffusion layers we can deposit layers of 
other materials. Layers of poly crystalline silicon (poly silicon or poly ), SiO 2 , 

and silicon nitride (Si 3 N 4 ), for example, may be deposited using chemical 







vapor deposition ( CVD ). Metal layers can be deposited using sputtering . All 
these layers are patterned using masks and similar photolithography steps to 
those shown in Figure 2.6. 



TABLE 2.2 CMOS process layers. 



Mask/layer name 


Derivation from 


Alternative names for 


MOSIS mask 


drawn layers 


mask/layer 


label 


n -well 


= nwellj_ 


bulk, substrate, tub, n 
-tub, moat 


CWN 


p -well 


= pwell 1 


bulk, substrate, tub, p 
-tub, moat 


CWP 


active 


= pdiff + ndiff 


thin oxide, thinox, island, 
gate oxide 


CAA 


polysilicon 


- poly 


poly, gate 


CPG 


n -diffusion implant 

2 


= grow (ndiff) 


ndiff, n -select, nplus, n+ 


CSN 


p -diffusion implant 

2 


= grow (pdiff) 


pdiff, p -select, pplus, p+ 


CSP 


contact 


= contact 


contact cut, poly contact, 
diffusion contact 


CCP and 
CCAJ3 


metal 1 


= ml 


first-level metal 


CMF 


metal2 


= m2 


second-level metal 


CMS 


via2 


= via2 


metal2/metal3 via, 
m2/m3 via 


CVS 


metal3 


= m3 


third-level metal 


CMT 


glass 


= glass 


passivation, overglass, 
pad 


COG 



Table 2.2 shows the mask layers (and their relation to the drawn layers) for a 
submicron, silicon-gate, three-level metal, self-aligned, CMOS process . A 
process in which the effective gate length is less than 1 m m is referred to as a 
submicron process . Gate lengths below 0.35 m m are considered in the 
deep-submicron regime. 

Figure 2.7 shows the layers that we draw to define the masks for the logic cell of 
Figure 1.3. Potential confusion arises because we like to keep layout simple but 
maintain a what you see is what you get (WYSIWYG) approach. This means 
that the drawn layers do not correspond directly to the masks in all cases. 



(a) nwell 



(b) pwell 



(c) ndiff 



(d) pdiff 




(e) poly (f) contact (g) ml (h) via 




(i) m2 (j) cell (k) phantom 



FIGURE 2.7 The standard cell shown in Figure 1.3. (a)(i) The drawn layers that 
define the masks. The active mask is the union of the ndiff and pdiff drawn 
layers. The n -diffusion implant and p -diffusion implant masks are bloated 
versions of the ndiff and pdiff drawn layers, (j) The complete cell layout, (k) The 
phantom cell layout. Often an ASIC vendor hides the details of the internal cell 
construction. The phantom cell is used for layout by the customer and then 
instantiated by the ASIC vendor after layout is complete. This layout uses 
grayscale stipple patterns to distinguish between layers. 

We can construct wells in a CMOS process in several ways. In an n-well process 
, the substrate is p -type (the wafer itself) and we use an n -well mask to build the 
n -well. We do not need a p -well mask because there are no p -wells in an n 
-well processthe n -channel transistors all sit in the substrate (the wafer)but we 
often draw the p -well layer as though it existed. In a p-well process we use a p 
-well mask to make the p -wells and the n -wells are the substrate. In a twin-tub 
(or twin-well ) process, we create individual wells for both types of transistors, 





and neither well is the substrate (which may be either n -type or p -type). There 
are even triple-well processes used to achieve even more control over the 
transistor performance. Whatever process that we use we must connect all the n 
-wells to the most positive potential on the chip, normally VDD, and all the p 
-wells to VSS; otherwise we may forward bias the bulk to source/drain pn 
-junctions. The bulk connections for CMOS transistors are not usually drawn in 
digital circuit schematics, but these substrate contacts ( well contacts or tub ties ) 
are very important. After we make the well(s), we grow a layer (approximately 

o 

1500 A) of Si 3 N 4 over the wafer. The active mask (CAA) leaves this nitride 
layer only in the active areas that will later become transistors or substrate 
contacts. Thus 

CAA (mask) = ndiff (drawn) ( pdiff (drawn) , (2.18) 

the ( symbol represents OR (union) of the two drawn layers, ndiff and pdiff. 
Everything outside the active areas is known as the field region, or just field . 

Next we implant the substrate to prevent unwanted transistors from forming in 
the field regionthis is the field implant or channel- stop implant . The nitride over 
the active areas acts as an implant mask and we may use another field-implant 

o 

mask at this step also. Following this we grow a thick (approximately 5000 A) 
layer of SiO 2 , the field oxide ( FOX ). The FOX will not grow over the nitride 
areas. When we strip the nitride we are left with FOX in the areas we do not want 
to dope the silicon. Following this we deposit, dope, mask, and etch the poly gate 
material, CPG (mask) = poly (drawn). Next we create the doped regions that 
form the sources, drains, and substrate contacts using ion implantation. The poly 
gate functions like masking tape in these steps. One implant (using phosphorous 
or arsenic ions) forms the n -type source/drain for the n -channel transistors and n 
-type substrate contacts (CSN). A second implant (using boron ions) forms the p 
-type sourcedrain for the p -channel transistors and p -type substrate contacts 
(CSP). These implants are masked as follows 

CSN (mask) = grow (ndiff (drawn)), (2.19) 

CSP (mask) = grow (pdiff (drawn)), (2.20) 

where grow means that we expand or bloat the drawn ndiff and drawn pdiff 
layers slightly (usually by a few 1 ). 

During implantation the dopant ions are blocked by the resist pattern defined by 
the CSN and CSP masks. The CSN mask thus prevents the n -type regions being 
implanted with p -type dopants (and vice versa for the CSP mask). As we shall 
see, the CSN and CSP masks are not intended to define the edges of the n -type 
and p -type regions. Instead these two masks function more like newspaper that 
prevents paint from spraying everywhere. The dopant ions are also blocked from 
reaching the silicon surface by the poly gates and this aligns the edge of the 
source and drain regions to the edges of the gates (we call this a self-aligned 
process ). In addition, the implants are blocked by the FOX and this defines the 
outside edges of the source, drain, and substrate contact regions. 




The only areas of the silicon surface that are doped n -type are 
n -diffusion (silicon) = (CAA (mask) ' CSN (mask)) ’ ( y CPG (mask)) ; (2.21) 

where the ' symbol represents AND (the intersection of two layers); and the y 
symbol represents NOT. 

Similarly, the only regions that are doped p -type are 

p -diffusion (silicon) = (CAA (mask) ' CSP (mask)) ’ ( y CPG (mask)) . (2.22) 

If the CSN and CSP masks do not overlap, it is possible to save a mask by using 
one implant mask (CSN or CSP) for the other type (CSP or CSN). We can do this 
by using a positive resist (the pattern of resist remaining after developing is the 
same as the dark areas on the mask) for one implant step and a negative resist 
(vice versa) for the other step. However, because of the poor resolution of 
negative resist and because of difficulties in generating the implant masks 
automatically from the drawn diffusions (especially when opposite diffusion 
types are drawn close to each other or touching), it is now common to draw both 
implant masks as well as the two diffusion layers. 

It is important to remember that, even though poly is above diffusion, the 
polysilicon is deposited first and acts like masking tape. It is rather like 
airbrushing a stripeyou use masking tape and spray everywhere without 
worrying about making straight lines. The edges of the pattern will align to the 
edge of the tape. Here the analogy ends because the poly is left in place. Thus, 

n -diffusion (silicon) = (ndiff (drawn)) ’ ( y poly (drawn)) and (2.23) 
p -diffusion (silicon) = (pdiff (drawn)) ’ ( y poly (drawn)) . (2.24) 

In the ASIC industry the names nplus, n +, and n -diffusion (as well as the p -type 
equivalents) are used in various ways. These names may refer to either the drawn 
diffusion layer (that we call ndiff), the mask (CSN), or the doped region on the 
silicon (the intersection of the active and implant mask that we call n -diffusion) 
very confusing. 

The source and drain are often formed from two separate implants. The first is a 
light implant close to the edge of the gate, the second a heavier implant that 
forms the rest of the source or drain region. The separate diffusions reduce the 
electric field near the drain end of the channel. Tailoring the device 
characteristics in this fashion is known as drain engineering and a process 
including these steps is referred to as an LDD process , for lightly doped drain ; 
the first light implant is known as an LDD diffusion or LDD implant. 




FIGURE 2.8 Drawn layers and 
an example set of 
black-and-white stipple patterns 
for a CMOS process. On top are 
the patterns as they appear in 
layout. Underneath are the 
magnified 8-by-8 pixel patterns. 

If we are trying to simplify 
layout we may use solid black or 
white for contact and vias. If we 
have contacts and vias placed on 
top of one another we may use 
stipple patterns or other means 
to help distinguish between 
them. Each stipple pattern is 
transparent, so that black shows 
through from underneath when 
layers are superimposed. There 
are no standards for these 
patterns. 

Figure 2.8 shows a stipple-pattern matrix for a CMOS process. When we draw 
layout you can see through the layersall the stipple patterns are ORed together. 
Figure 2.9 shows the transistor layers as they appear in layout (drawn using the 
patterns from Figure 2.8) and as they appear on the silicon. Figure 2.10 shows the 
same thing for the interconnect layers. 




FIGURE 2.9 The transistor layers, (a) A p -channel transistor as drawn in layout, 
(b) The corresponding silicon cross section (the heavy lines in part a show the 
cuts). This is how a p -channel transistor would look just after completing the 
source and drain implant steps. 
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FIGURE 2.10 The interconnect 
layers, (a) Metal layers as drawn 
in layout, (b) The corresponding 
structure (as it might appear in a 
scanning-electron micrograph). 
The insulating layers between 
the metal layers are not shown. 
Contact is made to the 
underlying silicon through a 
platinum barrier layer. Each via 
consists of a tungsten plug. Each 
metal layer consists of a 
titaniumtungsten and aluminum 
copper sandwich. Most deep 
submicron CMOS processes use 
metal structures similar to this. 
The scale, rounding, and 
irregularity of the features are 
realistic. 




2.2.1 Sheet Resistance 



Tables 2.3 and 2.4 show the sheet resistance for each conducting layer (in 
decreasing order of resistance) for two different generations of CMOS process. 



TABLE 2.3 Sheet resistance (1mm 
CMOS). 

Sheet 

Layer Units 

resistance 



TABLE 2.4 Sheet resistance (0.35 m 
m CMOS). 

Sheet 

Layer Units 

resistance 



n -well 


1.15+0.25 


poly 


3.5 + 2.0 


n -diffusion 


75 + 20 


p -diffusion 


140 + 40 


ml/2 


70+6 


m3 


30+3 



kW/ 

square 

W/ 

square 

W/ 

square 

W/ 

square 

m W/ 
square 
m W/ 
square 



n -well 
poly 

n -diffusion 
p -diffusion 
ml/2/3 
metal4 



1+0.4 
10 + 4.0 



kW/ 

square 

W/ 

square 



3.5 + 2.0 



W/ 

square 



2.5 ±1.5 



W/ 

square 



60 + 6 



mW/ 

square 



30 + 3 



m W / 
square 



The diffusion layers, n -diffusion and p -diffusion, both have a high resistivity 
typically from 1100 W /square. We measure resistance in W / square (ohms per 
square) because for a fixed thickness of material it does not matter what the size 





of a square isthe resistance is the same. Thus the resistance of a rectangular 
shape of a sheet of material may be calculated from the number of squares it 
contains times the sheet resistance in W / square. We can use diffusion for very 
short connections inside a logic cell, but not for interconnect between logic cells. 
Poly has the next highest resistance to diffusion. Most submicron CMOS 
processes use a silicide material (a metallic compound of silicon) that has much 
lower resistivity (at several W /square) than the poly or diffusion layers alone. 
Examples are tantalum silicide, TaSi; tungsten silicide, WSi; or titanium silicide, 
TiSi. The stoichiometry of these deposited silicides varies. For example, for 
tungsten silicide W:Si a 1:2.6. 

There are two types of silicide process. In a silicide process only the gate is 
silicided. This reduces the poly sheet resistance, but not that of the sourcedrain. 

In a self-aligned silicide ( salicide ) process, both the gate and the sourcedrain 
regions are silicided. In some processes silicide can be used to connect adjacent 
poly and diffusion (we call this feature LI , white metal, local interconnect, 
metalO, or mO). LI is useful to reduce the area of ASIC RAM cells, for example. 

Interconnect uses metal layers with resistivities of tens of m W /square, several 
orders of magnitude less than the other layers. There are usually several layers of 
metal in a CMOS ASIC process, each separated by an insulating layer. The metal 
layer above the poly gate layer is the first-level metal ( ml or metal 1), the next is 
the second-level metal ( m2 or metal2), and so on. We can make connections 
from ml to diffusion using diffusion contacts or to the poly using poly silicon 
contacts . 

After we etch the contact holes a thin barrier metal (typically platinum) is 
deposited over the silicon and poly. Next we form contact plugs ( via plugs for 
connections between metal layers) to reduce contact resistance and the likelihood 
of breaks in the contacts. Tungsten is commonly used for these plugs. Following 
this we form the metal layers as sandwiches. The middle of the sandwich is a 

o o 

layer (usually from 3000 A to 10,000 A) of aluminum and copper. The top and 
bottom layers are normally titaniumtungsten (TiW, pronounced tie-tungsten). 
Submicron processes use chemicalmechanical polishing ( CMP ) to smooth the 
wafers flat before each metal deposition step to help with step coverage. 

An insulating glass, often sputtered quartz (SiO 2 ), though other materials are 

also used, is deposited between metal layers to help create a smooth surface for 
the deposition of the metal. Design rules may refer to this insulator as an 
intermetal oxide ( IMO ) whether they are in fact oxides or not, or interlevel 
dielectric ( ILD ). The IMO may be a spin-on polymer; boron-doped 
phosphosilicate glass (BPSG); Si 3 N 4 ; or sandwiches of these materials 

(oxynitrides, for example). 

We make the connections between ml and m2 using metal vias , cuts , or just 
vias . We cannot connect m2 directly to diffusion or poly; instead we must make 
these connections through ml using a via. Most processes allow contacts and vias 
to be placed directly above each other without restriction, arrangements known as 




stacked vias and stacked contacts . We call a process with ml and m2 a two-level 
metal ( 2LM ) technology. A 3LM process includes a third-level metal layer ( m3 
or metal3), and some processes include more metal layers. In this case a 
connection between ml and m2 will use an ml/m2 via, or vial ; a connection 
between m2 and m3 will use an m2/m3 via, or via2 , and so on. 

The minimum spacing of interconnects, the metal pitch , may increase with 
successive metal layers. The minimum metal pitch is the minimum spacing 
between the centers of adjacent interconnects and is equal to the minimum metal 
width plus the minimum metal spacing. 

Aluminum interconnect tends to break when carrying a high current density. 
Collisions between high-energy electrons and atoms move the metal atoms over a 
long period of time in a process known as electromigration . Copper is added to 
the aluminum to help reduce the problem. The other solution is to reduce the 
current density by using wider than minimum-width metal lines. 

Tables 2.5 and 2.6 show maximum specified contact resistance and via resistance 
for two generations of CMOS processes. Notice that a ml contact in either 
process is equal in resistance to several hundred squares of metal. 

TABLE 2.5 Contact resistance (1mm TABLE 2.6 Contact resistance (0.35 m 



CMOS). 




m CMOS). 




Contact/via type 


Resistance 

(maximum) 


Contact/ via type 


Resistance 

(maximum) 


m2/m3 via (via2) 


5 W 


m2/m3 via (via2) 


6 W 


ml/m2 via (vial) 


2 W 


ml/m2 via (vial) 


6 W 


ml/ p -diffusion 


20 W 


ml/ p -diffusion 


20 W 


contact 


contact 


ml/ n -diffusion 


20 W 


ml/ n -diffusion 


20 W 


contact 


contact 


ml/poly contact 


20 W 


ml/poly contact 


20 W 



1 . If only one well layer is drawn, the other mask may be derived from the drawn 
layer. For example, p -well (mask) = not (nwell (drawn)). A single-well process 
requires only one well mask. 

2. The implant masks may be derived or drawn. 

3. Largely for historical reasons the contacts to poly and contacts to active have 
different layer names. In the past this allowed a different sizing or process bias to 
be applied to each contact type when the mask was made. 





2.3 CMOS Design Rules 



Figure 2.1 1 defines the design rules for a CMOS process using pictures. Arrows 
between objects denote a minimum spacing, and arrows showing the size of an 
object denote a minimum width. Rule 3.1, for example, is the minimum width of 
poly (2 1 ). Each of the rule numbers may have different values for different 
manufacturersthere are no standards for design rules. Tables 2.72.9 show the 
MOSIS scalable CMOS rules. Table 2.7 shows the layer rules for the process 
front end , which is the front end of the line (as in production line) or FEOL . 
Table 2.8 shows the rules for the process back end ( BEOL ), the metal 
interconnect, and Table 2.9 shows the rules for the pad layer and glass layer. 
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FIGURE 2.11 The MOSIS scalable CMOS design rules (rev. 7). Dimensions are 
in 1 . Rule numbers are in parentheses (missing rule sets 1 1 13 are extensions to 
this basic process). 

TABLE 2.7 MOSIS scalable CMOS rules version 7the process front end. 



Layer 


Rule 


Explanation 


Value / 1 


well (CWN, CWP) 


1.1 


minimum width 


10 




1.2 


minimum space (different potential, a hot 
well) 


9 




1.3 


minimum space (same potential) 


0 or 6 




1.4 


minimum space (different well type) 


0 


active (CAA) 


2. 1/2.2 minimum width/space 


3 




2.3 


source/drain active to well edge space 


5 




2.4 


substrate/well contact active to well edge 






space 






2.5 


minimum space between active (different 
implant type) 


0 or 4 


poly (CPG) 


3. 1/3.2 minimum width/space 


2 




3.3 


minimum gate extension of active 


2 




3.4 


minimum active extension of poly 


3 




3.5 


minimum field poly to active space 


1 


select (CSN, CSP) 


4.1 


minimum select spacing to channel of 
transistor J_ 


3 




4.2 


minimum select overlap of active 


2 




4.3 


minimum select overlap of contact 


1 




4.4 


minimum select width and spacing 2 


2 


poly contact (CCP) 


5. La 


exact contact size 


2 ¥ 2 




5. 2. a 


minimum poly overlap 


1.5 




5.3. a 


minimum contact spacing 


2 


active contact (CCA) 6. La 


exact contact size 


2 ¥ 2 




6.2. a 


minimum active overlap 


1.5 




6. 3. a 


minimum contact spacing 


2 



6. 4. a minimum space to gate of transistor 



2 



TABLE 2.8 MOSIS scalable CMOS rules version 7the process back end. 



Layer Rule Explanation Value /I 

metal 1 (CMF) 7.1 minimum width 3 

7.2. a minimum space 3 

7.2. b minimum space (for minimum-width wires only) 2 

7.3 minimum overlap of poly contact 1 

7.4 minimum overlap of active contact 1 

vial (CVA) 8.1 exact size 2¥2 

8.2 minimum via spacing 3 

8.3 minimum overlap by metal 1 1 

8.4 minimum spacing to contact 2 

8.5 minimum spacing to poly or active edge 2 

metal2 (CMS) 9. 1 minimum width 3 

9.2. a minimum space 4 

9.2. b minimum space (for minimum-width wires only) 3 

9.3 minimum overlap of vial 1 

via2 (CVS) 14.1 exact size 2¥2 

14.2 minimum space 3 

14.3 minimum overlap by metal2 1 

14.4 minimum spacing to vial 2 

metal3 (CMT) 15.1 minimum width 6 

15.2 minimum space 4 

15.3 minimum overlap of via2 2 

TABLE 2.9 MOSIS scalable CMOS rules version 7the pads and overglass 
(passivation). 

Layer Rule Explanation Value 



glass (COG) 10.1 

10.2 

10.3 

10.4 

10.5 



minimum bonding-pad width 

minimum probe-pad width 
pad overlap of glass opening 
minimum pad spacing to unrelated metal2 
(or metal3) 

minimum pad spacing to unrelated 
metal 1, poly, or active 



100 mm¥ 100 m 
m 

75 m m¥ 75 m m 
6mm 

30 m m 
15 m m 



The rules in Table 2.7 and Table 2.8 are given as multiples of 1 . If we use 
lambda-based rules we can move between successive process generations just by 
changing the value of 1 . For example, we can scale 0.5 m m layouts ( 1 = 0.25 m 
m) by a factor of 0.175 / 0.25 for a 0.35 m m process ( 1 = 0.175 m m)at least in 




theory. You may get an inkling of the practical problems from the fact that the 
values for pad dimensions and spacing in Table 2.9 are given in microns and not 
in 1 . This is because bonding to the pads is an operation that does not scale well. 
Often companies have two sets of design rules: one in 1 (with fractional 1 rules) 
and the other in microns. Ideally we would like to express all of the design rules 
in integer multiples of 1 . This was true for revisions 46, but not revision 7 of the 
MOSIS rules. In revision 7 rules 5.2a/6.2a are noninteger. The original Mead 
Conway NMOS rules include a noninteger 1.5 1 rule for the implant layer. 

1. To ensure source and drain width. 

2. Different select types may touch but not overlap. 





2.4 Combinational Logic Cells 

The AND-OR-INVERT (AOI) and the OR-AND-INVERT (OAI) logic cells are 
particularly efficient in CMOS. Figure 2.12 shows an AOI221 and an OAI321 
logic cell (the logic symbols in Figure 2.12 are not standards, but are widely 
used). All indices (the indices are the numbers after AOI or OAI) in the logic cell 
name greater than 1 correspond to the inputs to the first level or stagethe AND 
gate(s) in an AOI cell, for example. An index of T corresponds to a direct input 
to the second-stage cell. We write indices in descending order; so it is AOI221 
and not AOI 122 (but both are equivalent cells), and AOI32 not AOI23. If we 
have more than one direct input to the second stage we repeat the T; thus an 
AOI21 1 cell performs the function Z = (A.B + C + D)'. A three-input NAND cell 
is an OAI1 1 1, but calling it that would be very confusing. These rules are not 
standard, but form a convention that we shall adopt and one that is widely used in 
the ASIC industry. 

There are many ways to represent the logical operator, AND. I shall use the 
middle dot and write A • B (rather than AB, A.B, or A ' B); occasionally I may 
use AND(A, B). Similarly I shall write A + B as well as OR(A, B). I shall use an 
apostrophe like this, A', to denote the complement of A rather than A since 
sometimes it is difficult or inappropriate to use an overbar ( vinculum ) or 
diacritical mark (macron). It is possible to misinterpret AB’ as A B rather than 
AB (but the former alternative would be A • B' in my convention). I shall be 
careful in these situations. 

FIGURE 2.12 Naming and 
numbering complex CMOS 
combinational cells, (a) An 
AND-OR-INVERT cell, an 
AOI221. (b) An 
OR-AND-INVERT cell, an 
OAI321. Numbering is 
always in descending order. 



AO 1221 OAI321 




(a) (b) 



We can express the function of the AOI221 cell in Figure 2.12(a) as 
Z = (A • B + C • D + E)' . (2.25) 

We can also write this equation unambiguously as Z = OAI221(A, B, C, D, E), 
just as we might write X = NAND (I, J, K) to describe the logic function 
X = (I • J • K)’. 




This notation is useful because, for example, if we write OAI321(P, Q, R, S, T, 

U) we immediately know that U (the sixth input) is the (only) direct input 
connected to the second stage. Sometimes we need to refer to particular inputs 
without listing them all. We can adopt another convention that letters of the input 
names change with the index position. Now we can refer to input B2 of an 
AOI321 cell, for example, and know which input we are talking about without 
writing 

Z = A0I321(A1, A2, A3, Bl, B2, C) . (2.26) 

Table 2.10 shows the AOI family of logic cells with three indices (with branches 
in the family for AOI, OAI, AO, and OA cells). There are 5 types and 14 separate 
members of each branch of this family. There are thus 4 ¥ 14 = 56 cells of the 
type X abc where X = { OAI, AOI, OA, AO } and each of the indexes a , b , and c 
can range from 1 to 3. We form the AND-OR (AO) and OR- AND (OA) cells by 
adding an inverter to the output of an AOI or OAI cell. 

TABLE 2.10 The AOI family of cells with three index numbers or less. 

Cell typej_ Cells Number of unique cells 



The AOI and OAI logic cells can be built using a single stage in CMOS using 
seriesparallel networks of transistors called stacks. Figure 2.13 illustrates the 
procedure to build the n -channel and p -channel stacks, using the AOI221 cell as 
an example. 



Xal 

Xall 

Xab 

Xabl 

Xabc 

Total 



X21, X31 
X211, X311 
X22, X33, X32 
X221, X331, X321 
X222, X333, X332, X322 



2 

2 

3 

3 

4 
14 



2.4.1 Pushing Bubbles 
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FIGURE 2.13 Constructing a CMOS logic cellan AOI221. (a) First build the 
dual icon by using de Morgans theorem to push inversion bubbles to the 
inputs, (b) Next build the n -channel and p -channel stacks from series and 
parallel combinations of transistors, (c) Adjust transistor sizes so that the n- 
channel and p -channel stacks have equal strengths. 



Here are the steps to construct any single-stage combinational CMOS logic cell: 

1 . Draw a schematic icon with an inversion (bubble) on the last cell (the 
bubble-out schematic). Use de Morgans theorems A NAND is an OR 
with inverted inputs and a NOR is an AND with inverted inputsto push 
the output bubble back to the inputs (this the dual icon or bubble-in 
schematic). 

2. Form the n -channel stack working from the inputs on the bubble-out 
schematic: OR translates to a parallel connection, AND translates to a 
series connection. If you have a bubble at an input, you need an inverter. 

3. Form the p -channel stack using the bubble-in schematic (ignore the 
inversions at the inputsthe bubbles on the gate terminals of the p -channel 
transistors take care of these). If you do not have a bubble at the input gate 
terminals, you need an inverter (these will be the same input gate terminals 
that had bubbles in the bubble-out schematic). 

The two stacks are network duals (they can be derived from each other by 
swapping series connections for parallel, and parallel for series connections). The 
n -channel stack implements the strong '0's of the function and the p -channel 
stack provides the strong Ts. The final step is to adjust the drive strength of the 
logic cell by sizing the transistors. 

2.4.2 Drive Strength 

Normally we ratio the sizes of the n -channel and p -channel transistors in an 
inverter so that both types of transistors have the same resistance, or drive 
strength . That is, we make b n = b p . At low dopant concentrations and low 




electric fields m n is about twice m p . To compensate we make the shape factor, 

W/L, of the p -channel transistor in an inverter about twice that of the n -channel 
transistor (we say the logic has a ratio of 2). Since the transistor lengths are 
normally equal to the minimum poly width for both types of transistors, the ratio 
of the transistor widths is also equal to 2. With the high dopant concentrations 
and high electric fields in submicron transistors the difference in mobilities is less 
typically between 1 and 1.5. 

Logic cells in a library have a range of drive strengths. We normally call the 
minimum-size inverter a IX inverter. The drive strength of a logic cell is often 
used as a suffix; thus a IX inverter has a cell name such as INVX1 or INVD1. An 
inverter with transistors that are twice the size will be an INVX2. Drive strengths 
are normally scaled in a geometric ratio, so we have IX, 2X, 4X, and 
(sometimes) 8X or even higher, drive-strength cells. We can size a logic cell 
using these basic rules: 

• Any string of transistors connected between a power supply and the output 
in a cell with IX drive should have the same resistance as the n -channel 
transistor in a IX inverter. 

• A transistor with shape factor W 1 /L 1 has a resistance proportional to L 1 
/W 1 (so the larger W 1 is, the smaller the resistance). 

• Two transistors in parallel with shape factors W 1 /L 1 and W 2 /L 2 are 
equivalent to a single transistor (W 1 /L 1 +W 2 /L 2 )/l. For example, a 2/1 
in parallel with a 3/1 is a 5/1. 

• Two transistors, with shape factors W 1 /L 2 and W 2 /L 2 , in series are 
equivalent to a single 1/(L 1 /W 1 + L 2 /W 2 ) transistor. 

For example, a transistor with shape factor 3/1 (we shall call this a 3/1) in series 
with another 3/1 is equivalent to a l/((l/3) + (1/3)) or a 3/2. We can use the 
following method to calculate equivalent transistor sizes: 

• To add transistors in parallel, make all the lengths 1 and add the widths. 

• To add transistors in series, make all the widths 1 and add the lengths. 

We have to be careful to keep W and L reasonable. For example, a 3/1 in series 
with a 2/1 is equivalent to a l/(( 1/3) + (1/2)) or 1/0.83. Since we cannot make a 
device 2 1 wide and 1.66 1 long, a 1/0.83 is more naturally written as 3/2.5. We 
like to keep both W and L as integer multiples of 0.5 (equivalent to making W 
and L integer multiples of 1 ), but W and L must be greater than 1. 

In Figure 2.13(c) the transistors in the AOI221 cell are sized so that any string 
through the p -channel stack has a drive strength equivalent to a 2/1 p -channel 
transistor (we choose the worst case, if more than one transistor in parallel is 
conducting then the drive strength will be higher). The n -channel stack is sized 
so that it has a drive strength of a 1/1 n -channel transistor. The ratio in this 
library is thus 2. 




If we were to use four drive strengths for each of the AOI family of cells shown 
in Table 2.10, we would have a total of 224 combinational library cellsjust for 
the AOI family. The synthesis tools can handle this number of cells, but we may 
not be able to design this many cells in a reasonable amount of time. Section 3.3, 
Logical Effort, will help us choose the most logically efficient cells. 

2.4.3 Transmission Gates 



Figure 2.14(a) and (b) shows a CMOS transmission gate ( TG , TX gate, pass 
gate, coupler). We connect a p -channel transistor (to transmit a strong T) in 
parallel with an n -channel transistor (to transmit a strong '0'). 
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FIGURE 2.14 CMOS transmission gate (TG). (a) An n- channel and p -channel 
transistor in parallel form a TG. (b) A common symbol for a TG. (c) The 
charge-sharing problem. 



We can express the function of a TG as 
Z = TG(A, S) , (2.27) 



but this is ambiguousif we write TG(X, Y), how do we know if X is connected to 
the gates or sources/drains of the TG? We shall always define TG(X, Y) when we 
use it. It is tempting to write TG(A, S) = A • S, but what is the value of Z when S 
='0' in Figure 2.14(a), since Z is then left floating? A TG is a switch, not an AND 
logic cell. 

There is a potential problem if we use a TG as a switch connecting a node Z that 
has a large capacitance, C BIG , to an input node A that has only a small 

capacitance C small ( see Figure 2.14c). If the initial voltage at A is V small 
and the initial voltage at Z is V BIG , when we close the TG (by setting S = T) the 
final voltage on both nodes A and Z is 

C BIG V BIG + C SMALL V SMALL 
V F = . (2.28) 

C BIG + C SMALL 



Imagine we want to drive a '0' onto node Z from node A. Suppose C BIG = 0.2 pF 
(about 10 standard loads in a 0.5 m m process) and C small = 0.02 pF, V BIG = 0 
V and V small = 5 V; then 





(0.2 ¥ 10 12 ) (0) + (0.02 ¥ 10 12 ) (5) 

V F = = 0.45 V . (2.29) 

( 0.2 ¥ 10 12 ) + ( 0.02 ¥ 10 12 ) 

This is not what we want at all, the big capacitor has forced node A to a voltage 
close to a 'O'. This type of problem is known as charge sharing . We should make 
sure that either (1) node A is strong enough to overcome the big capacitor, or (2) 
insulate node A from node Z by including a buffer (an inverter, for example) 
between node A and node Z. We must not use charge to drive another logic cell 
only a logic cell can drive a logic cell. 

If we omit one of the transistors in a TG (usually the p -channel transistor) we 
have a pass transistor . There is a branch of full-custom VLSI design that uses 
pass-transistor logic. Much of this is based on relay-based logic, since a single 
transistor switch looks like a relay contact. There are many problems associated 
with pass-transistor logic related to charge sharing, reduced noise margins, and 
the difficulty of predicting delays. Though pass transistors may appear in an 
ASIC cell inside a library, they are not used by ASIC designers. 
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FIGURE 2.15 The CMOS multiplexer (MUX), (a) A noninverting 2:1 MUX 
using transmission gates without buffering, (b) A symbol for a MUX (note how 
the inputs are labeled), (c) An IEEE standard symbol for a MUX. (d) A 
nonstandard, but very common, IEEE symbol for a MUX. (e) An inverting 
MUX with output buffer, (f) A noninverting buffered MUX. 

We can use two TGs to form a multiplexer (or multiplexorpeople use both 
orthographies) as shown in Figure 2.15(a). We often shorten multiplexer to MUX 
. The MUX function for two data inputs, A and B, with a select signal S, is 

Z = TG(A, S') + TG(B, S) . (2.30) 

We can write this as Z = A • S’ + B • S, since node Z is always connected to one 
or other of the inputs (and we assume both are driven). This is a two-input MUX 
(2-to-l MUX or 2:1 MUX). Unfortunately, we can also write the MUX function 
as Z = A • S + B • S’, so it is difficult to write the MUX function unambiguously 
as Z = MUX(X, Y, Z). For example, is the select input X, Y, or Z? We shall 
define the function MUX(X, Y, Z) each time we use it. We must also be careful 
to label a MUX if we use the symbol shown in Figure 2.15(b). Symbols for a 





MUX are shown in Figure 2.15(bd). In the IEEE notation 'G' specifies an AND 
dependency. Thus, in Figure 2.15(c), G = T selects the input labeled T. 

Figure 2.15(d) uses the common control block symbol (the notched rectangle). 
Here, G1 = T selects the input T, and G1 = ’O’ selects the input ' 1 ’. Strictly this 
form of IEEE symbol should be used only for elements with more than one 
section controlled by common signals, but the symbol of Figure 2.15(d) is used 
often for a 2: 1 MUX. 

The MUX shown in Figure 2.15(a) works, but there is a potential charge- sharing 
problem if we cascade MUXes (connect them in series). Instead most ASIC 
libraries use MUX cells built with a more conservative approach. We could 
buffer the output using an inverter (Figure 2.15e), but then the MUX becomes 
inverting. To build a safe, noninverting MUX we can buffer the inputs and output 
(Figure 2.15f)requiring 12 transistors, or 3 gate equivalents (only the gate 
equivalent counts are shown from now on). 

Figure 2.16 shows how to use an OAI22 logic cell (and an inverter) to implement 
an inverting MUX. The implementation in equation form (2.5 gates) is 

ZN = A’ • S' + B' • S 
= [(A' • S’)’ • (B' • S)’]’ 

= [ (A + S) • (B + S’)]’ 

= OAI22[A, S, B, NOT(S)] . (2.31) 



(both A' and NOT(A) represent an inverter, depending on which representation is 
most convenientthey are equivalent). I often use an equation to describe a cell 
implementation. 



FIGURE 2.16 An inverting 2: 1 MUX based on an 
OAI22 cell. 




The following factors will determine which MUX implementation is best: 

1 . Do we want to minimize the delay between the select input and the output 
or between the data inputs and the output? 

2. Do we want an inverting or noninverting MUX? 

3. Do we object to having any logic cell inputs tied directly to the 
source/drain diffusions of a transmission gate? (Some companies forbid 
such transmission-gate inputs since some simulation tools cannot handle 
them.) 

4. Do we object to any logic cell outputs being tied to the source/drain of a 




transmission gate? (Some companies will not allow this because of the 
dangers of charge sharing.) 

5. What drive strength do we require (and is size or speed more important)? 

A minimum-size TG is a little slower than a minimum-size inverter, so there is 
not much difference between the implementations shown in Figure 2.15 and 
Figure 2.16, but the difference can become important for 4:1 and larger MUXes. 

2.4.4 Exclusive-OR Cell 

The two-input exclusive-OR ( XOR , EXOR, not-equivalence, ring-OR) function 
is 

A1 • A2 = XOR(Al, A2) = A1 • A2’ + Al' • A2 . (2.32) 

We are now using multiletter symbols, but there should be no doubt that Al' 
means anything other than NOT(Al). We can implement a two-input XOR using 
a MUX and an inverter as follows (2 gates): 

XOR(Al, A2) = MUX[NOT(Al), Al, A2] , (2.33) 
where 

MUX(A, B, S) = A • S + B • S ' . (2.34) 

This implementation only buffers one input and does not buffer the MUX output. 
We can use inverter buffers (3.5 gates total) or an inverting MUX so that the 
XOR cell does not have any external connections to source/drain diffusions as 
follows (3 gates total): 

XOR(Al, A2) = NOT[MUX(NOT[NOT (A 1 )] , NOT(Al), A2)] . (2.35) 

We can also implement a two-input XOR using an AOI21 (and a NOR cell), 
since 

XOR(Al, A2) = Al • A2' + Al' • A2 

= [ (Al -A2) + (Al + A2)' ]' 

= AOI21[Al, A2, NOR(Al, A2)], (2.36) 

(2.5 gates). Similarly we can implement an exclusive-NOR (XNOR, equivalence) 
logic cell using an inverting MUX (and two inverters, total 3.5 gates) or an 
OAI21 logic cell (and a NAND cell, total 2.5 gates) as follows (using the MUX 
function of Eq. 2.34): 

XNOR(Al, A2) = Al • A2 + NOT(Al) • NOT(A2 

= NOT[NOT[MUX(Al, NOT (Al), A2]] 

= OAI21[Al, A2, NAND(A1, A2)] . (2.37) 



1. Xabc: X = {AOI, AO, OAI, OA}; a, b, c = {2, 3}; { } means choose one. 





2.5 Sequential Logic Cells 

There are two main approaches to clocking in VLSI design: multiphase clocks or 
a single clock and synchronous design . The second approach has the following 
key advantages: (1) it allows automated design, (2) it is safe, and (3) it permits 
vendor signoff (a guarantee that the ASIC will work as simulated). These 
advantages of synchronous design (especially the last one) usually outweigh 
every other consideration in the choice of a clocking scheme. The vast majority 
of ASICs use a rigid synchronous design style. 

2.5.1 Latch 

Figure 2.17(a) shows a sequential logic cella latch . The internal clock signals, 
CLKN (N for negative) and CLKP (P for positive), are generated from the system 
clock, CLK, by two inverters (14 and 15) that are part of every latch cellit is 
usually too dangerous to have these signals supplied externally, even though it 
would save space. 
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FIGURE 2.17 CMOS latch, (a) A positive-enable latch using transmission gates 
without output buffering, the enable (clock) signal is buffered inside the latch. 

(b) A positive-enable latch is transparent while the enable is high, (c) The latch 
stores the last value at D when the enable goes low. 

To emphasize the difference between a latch and flip-flop, sometimes people 
refer to the clock input of a latch as an enable . This makes sense when we look 
at Figure 2.17(b), which shows the operation of a latch. When the clock input is 
high, the latch is transparent changes at the D input appear at the output Q (quite 
different from a flip-flop as we shall see). When the enable (clock) goes low 
(Figure 2.17c), inverters 12 and 13 are connected together, forming a storage loop 




that holds the last value on D until the enable goes high again. The storage loop 
will hold its state as long as power is on; we call this a static latch. A sequential 
logic cell is different from a combinational cell because it has this feature of 
storage or memory. 

Notice that the output Q is unbuffered and connected directly to the output of 12 
(and the input of 13), which is a storage node. In an ASIC library we are 
conservative and add an inverter to buffer the output, isolate the sensitive storage 
node, and thus invert the sense of Q. If we want both Q and QN we have to add 
two inverters to the circuit of Figure 2.17(a). This means that a latch requires 
seven inverters and two TGs (4.5 gates). 

The latch of Figure 2.17(a) is a positive-enable D latch, active-high D latch, or 
transparent-high D latch (sometimes people also call this a D-type latch). A 
negative-enable (active-low) D latch can be built by inverting all the clock 
polarities in Figure 2.17(a) (swap CLKN for CLKP and vice-versa). 

2.5.2 Flip-Flop 

Figure 2.18(a) shows a flip-flop constructed from two D latches: a master latch 
(the first one) and a slave latch . This flip-flop contains a total of nine inverters 
and four TGs, or 6.5 gates. In this flip-flop design the storage node S is buffered 
and the clock-to-Q delay will be one inverter delay less than the clock-to-QN 
delay. 
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FIGURE 2.18 CMOS flip-flop, (a) This negative-edgetriggered flip-flop 
consists of two latches: master and slave, (b) While the clock is high, the master 
latch is loaded, (c) As the clock goes low, the slave latch loads the value of the 
master latch, (d) Waveforms illustrating the definition of the flip-flop setup time 
t su , hold time t H , and propagation delay from clock to Q, t PD . 

In Figure 2.18(b) the clock input is high, the master latch is transparent, and node 
M (for master) will follow the D input. Meanwhile the slave latch is disconnected 
from the master latch and is storing whatever the previous value of Q was. As the 
clock goes low (the negative edge) the slave latch is enabled and will update its 
state (and the output Q) to the value of node M at the negative edge of the clock. 
The slave latch will then keep this value of M at the output Q, despite any 
changes at the D input while the clock is low (Figure 2.18c). When the clock 
goes high again, the slave latch will store the captured value of M (and we are 
back where we started our explanation). 

The combination of the master and slave latches acts to capture or sample the D 
input at the negative clock edge, the active clock edge . This type of flip-flop is a 




negative-edgetriggered flip-flop and its behavior is quite different from a latch. 
The behavior is shown on the IEEE symbol by using a triangular notch to 
denote an edge-sensitive input. A bubble shows the input is sensitive to the 
negative edge. To build a positive-edgetriggered flip-flop we invert the polarity 
of all the clocksas we did for a latch. 

The waveforms in Figure 2.18(d) show the operation of the flip-flop as we have 
described it, and illustrate the definition of setup time ( t su ), hold time ( t H ), 
and clock-to-Q propagation delay ( t PD ). We must keep the data stable (a fixed 
logic Tor '0') for a time t su prior to the active clock edge, and stable for a time t 
H after the active clock edge (during the decision window shown). 

In Figure 2.18(d) times are measured from the points at which the waveforms 
cross 50 percent of V DD . We say the trip point is 50 percent or 0.5. Common 
choices are 0.5 or 0.65/0.35 (a signal has to reach 0.65 V DD to be a T, and reach 
0.35 V DD to be a '0'), or 0. 1/0.9 (there is no standard way to write a trip point). 
Some vendors use different trip points for the input and output waveforms 
(especially in I/O cells). 

The flip-flop in Figure 2.18(a) is a D flip-flop and is by far the most widely used 
type of flip-flop in ASIC design. There are other types of flip-flops J-K, T 
(toggle), and S-R flip-flopsthat are provided in some ASIC cell libraries mainly 
for compatibility with TTL design. Some people use the term register to mean an 
array (more than one) of flip-flops or latches (on a data bus, for example), but 
some people use register to mean a single flip-flop or a latch. This is confusing 
since flip-flops and latches are quite different in their behavior. When I am 
talking about logic cells, I use the term register to mean more than one flip-flop. 

To add an asynchronous set (Q to T) or asynchronous reset (Q to '0') to the 
flip-flop of Figure 2.18(a), we replace one inverter in both the master and slave 
latches with two-input NAND cells. Thus, for an active-low set, we replace 12 
and 17 with two-input NAND cells, and, for an active-low reset, we replace 13 
and 16. For both set and reset we replace all four inverters: 12, 13, 16, and 17. 

Some TTF flip-flops have dominant reset or dominant set , but this is difficult 
(and dangerous) to do in ASIC design. An input that forces Q to T is sometimes 
also called preset . The IEEE logic symbols use 'P' to denote an input with a 
presetting action. An input that forces Q to 'O' is often also called clear . The 
IEEE symbols use 'R' to denote an input with a resetting action. 

2.5.3 Clocked Inverter 

Figure 2.19 shows how we can derive the structure of a clocked inverter from the 
series combination of an inverter and a TG. The arrows in Figure 2.19(b) 
represent the flow of current when the inverter is charging ( I R ) or discharging ( 

I F ) a load capacitance through the TG. We can break the connection between the 
inverter cells and use the circuit of Figure 2.19(c) without substantially affecting 




the operation of the circuit. The symbol for the clocked inverter shown in 
Figure 2.19(d) is common, but by no means a standard. 




FIGURE 2.19 Clocked inverter, (a) An inverter plus transmission gate (TG). 

(b) The current flow in the inverter and TG allows us to break the connection 
between the transistors in the inverter, (c) Breaking the connection forms a 
clocked inverter, (d) A common symbol. 

We can use the clocked inverter to replace the inverterTG pairs in latches and 
flip-flops. For example, we can replace one or both of the inverters II and 13 
(together with the TGs that follow them) in Figure 2.17(a) by clocked inverters. 
There is not much to choose between the different implementations in this case, 
except that layout may be easier for the clocked inverter versions (since there is 
one less connection to make). 

More interesting is the flip-flop design: We can only replace inverters II, 13, and 
17 (and the TGs that follow them) in Figure 2.18(a) by clocked inverters. We 
cannot replace inverter 16 because it is not directly connected to a TG. We can 
replace the TG attached to node M with a clocked inverter, and this will invert 
the sense of the output Q, which thus becomes QN. Now the clock-to-Q delay 
will be slower than clock-to-QN, since Q (which was QN) now comes one 
inverter later than QN. 

If we wish to build a flip-flop with a fast clock-to-QN delay it may be better to 
build it using clocked inverters and use inverters with TGs for a flip-flop with a 
fast clock-to-Q delay. In fact, since we do not always use both Q and QN outputs 
of a flip-flop, some libraries include Q only or QN only flip-flops that are slightly 
smaller than those with both polarity outputs. It is slightly easier to layout 
clocked inverters than an inverter plus a TG, so flip-flops in commercial libraries 
include a mixture of clocked-inverter and TG implementations. 






2.6 Datapath Logic Cells 

Suppose we wish to build an n -bit adder (that adds two n -bit numbers) and to exploit 
the regularity of this function in the layout. We can do so using a datapath structure. 

The following two functions, SUM and COUT, implement the sum and cany out for a 
full adder ( FA ) with two data inputs (A, B) and a cany in, CIN: 

SUM = A • B • CIN = SUM(A, B, CIN) = PARITY (A, B, CIN) , (2.38) 

COUT = A • B + A • CIN + B • CIN = MAJ(A, B, CIN). (2.39) 

The sum uses the parity function (T if there are an odd numbers of 'l's in the inputs). 
The carry out, COUT, uses the 2-of-3 majority function (T if the majority of the inputs 
are T). We can combine these two functions in a single FA logic cell, ADD(A[ i ], B [ i 
], CIN, S[ i ], COUT), shown in Figure 2.20(a), where 

S[ i ] = SUM (A[ i ], B [ i ], CIN) , (2.40) 

COUT = MAJ (A[ i ], B [ i ], CIN) . (2.41) 

Now we can build a 4-bit ripple-cany adder ( RCA ) by connecting four of these ADD 
cells together as shown in Figure 2.20(b). The i th ADD cell is ananged with the 
following: two bus inputs A[ i ], B[ i ]; one bus output S[ i ]; an input, CIN, that is the 
carry in from stage ( i 1) below and is also passed up to the cell above as an output; 
and an output, COUT, that is the cany out to stage ( i + 1) above. In the 4-bit adder 
shown in Figure 2.20(b) we connect the cany input, CIN[0], to VSS and use COUT[3] 
and COUT[2] to indicate arithmetic overflow (in Section 2.6.1 we shall see why we 
may need both signals). Notice that we build the ADD cell so that COUT [2] is 
available at the top of the datapath when we need it. 

Figure 2.20(c) shows a layout of the ADD cell. The A inputs, B inputs, and S outputs 
all use ml interconnect running in the horizontal directionwe call these data signals. 
Other signals can enter or exit from the top or bottom and run vertically across the 
datapath in m2we call these control signals. We can also use ml for control and m2 for 
data, but we normally do not mix these approaches in the same structure. Control 
signals are typically clocks and other signals common to elements. For example, in 
Figure 2.20(c) the carry signals, CIN and COUT, run vertically in m2 between cells. To 
build a 4-bit adder we stack four ADD cells creating the array structure shown in 
Figure 2.20(d). In this case the A and B data bus inputs enter from the left and bus S, 
the sum, exits at the right, but we can connect A, B, and S to either side if we want. 

The layout of buswide logic that operates on data signals in this fashion is called a 
datapath . The module ADD is a datapath cell or datapath element . Just as we do for 
standard cells we make all the datapath cells in a library the same height so we can abut 




other datapath cells on either side of the adder to create a more complex datapath. 
When people talk about a datapath they always assume that it is oriented so that 
increasing the size in bits makes the datapath grow in height, upwards in the vertical 
direction, and adding different datapath elements to increase the function makes the 
datapath grow in width, in the horizontal directionbut we can rotate and position a 
completed datapath in any direction we want on a chip. 




FIGURE 2.20 A datapath adder, (a) A full-adder (FA) cell with inputs (A and B), a 
carry in, CIN, sum output, S, and cany out, COUT. (b) A 4-bit adder, (c) The layout, 
using two-level metal, with data in ml and control in m2. In this example the wiring is 
completed outside the cell; it is also possible to design the datapath cells to contain the 
wiring. Using three levels of metal, it is possible to wire over the top of the datapath 
cells, (d) The datapath layout. 



What is the difference between using a datapath, standard cells, or gate arrays? Cells 
are placed together in rows on a CBIC or an MGA, but there is no generally no 
regularity to the arrangement of the cells within the rowswe let software arrange the 
cells and complete the interconnect. Datapath layout automatically takes care of most 
of the interconnect between the cells with the following advantages: 

• Regular layout produces predictable and equal delay for each bit. 

• Interconnect between cells can be built into each cell. 

There are some disadvantages of using a datapath: 

• The overhead (buffering and routing the control signals, for example) can make a 
narrow (small number of bits) datapath larger and slower than a standard-cell (or 
even gate-array) implementation. 

• Datapath cells have to be predesigned (otherwise we are using full-custom 
design) for use in a wide range of datapath sizes. Datapath cell design can be 
harder than designing gate-array macros or standard cells. 

• Software to assemble a datapath is more complex and not as widely used as 
software for assembling standard cells or gate arrays. 

There are some newer standard-cell and gate-array tools that can take advantage of 
regularity in a design and position cells carefully. The problem is in finding the 
regularity if it is not specified. Using a datapath is one way to specify regularity to 
ASIC design tools. 






2.6.1 Datapath Elements 

Figure 2.21 shows some typical datapath symbols for an adder (people rarely use the 
IEEE standards in ASIC datapath libraries). I use heavy lines (they are 1.5 point wide) 
with a stroke to denote a data bus (that flows in the horizontal direction in a datapath), 
and regular lines (0.5 point) to denote the control signals (that flow vertically in a 
datapath). At the risk of adding confusion where there is none, this stroke to indicate a 
data bus has nothing to do with mixed-logic conventions. For a bus, A[31:0] denotes a 
32-bit bus with A[31] as the leftmost or most-significant bit or MSB , and A[0] as the 
least-significant bit or LSB . Sometimes we shall use A[MSB] or A[LSB] to refer to 
these bits. Notice that if we have an n -bit bus and LSB = 0, then MSB = n 1. Also, for 
example, A[4] is the fifth bit on the bus (from the LSB). We use a ' S ' or 'ADD' inside 
the symbol to denote an adder instead of '+', so we can attach " or '+/' to the inputs for 
a subtracter or adder/subtracter. 
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FIGURE 2.21 Symbols for a datapath adder, (a) A data bus is shown by a heavy line 
(1.5 point) and a bus symbol. If the bus is n -bits wide then MSB = n 1. (b) An 
alternative symbol for an adder, (c) Control signals are shown as lightweight (0.5 
point) lines. 



Some schematic datapath symbols include only data signals and omit the control 
signalsbut we must not forget them. In Figure 2.21, for example, we may need to 
explicitly tie CIN[0] to VSS and use COUT[MSB] and COUT[MSB 1] to detect 
overflow. Why might we need both of these control signals? Table 2.11 shows the 
process of simple arithmetic for the different binary number representations, including 
unsigned, signed magnitude, ones complement, and twos complement. 

TABLE 2.11 Binary arithmetic. 

Binary Number Representation 



Operation 


Unsigned 


Signed 

magnitude 


Ones 

complement 


Twos 

complement 




no change 


if positive 
then MSB = 0 

else MSB = 1 


if negative then flip 
bits 


if negative then {flip 
bits; add 1 } 


3 = 


0011 


0011 


0011 


0011 


3 = 


NA 


1011 


1100 


1101 


zero = 


0000 


0000 or 1000 


1111 or 0000 


0000 





max. 

positive = 


1111 = 15 


0111 =7 


max. 

negative = 


0000= 0 


1111=7 


addition = 




if SG(A) = 


S = A + B 




SG(B) then S 


= addend + 




= A + B 


augend 


S = A + B 


else { if B < A 
then S = A B 


SG(A) = 
sign of A 




else S = B 
A} 


addition 

result: 

OV = 
overflow, 


OR = 

COUT[MSB] 


if SG(A) = 
SG(B) then 
OV = 

COUT[MSB] 


OR = out 
of range 


COUT is 
cany out 


else OV = 0 
(impossible) 






if SG(A) = 
SG(B) then 


SG(S) = 
sign of S 




SG(S) = 
SG(A) 




NA 


else { if B < A 
then SG(S) = 


S = A + B 




SG(A) 






else SG(S) = 
SG(B)} 


subtraction 
D = A B 


D = A B 


SG(B) = 
NOT(SG(B)); 


= minuend 




D = A + B 


subtrahend 
subtraction 
result : 


OR = 




OV = 
overflow, 


BOUT[MSB] 
BOUT is 


as in addition 


OR = out 
of range 


borrow out 





0111=7 0111=7 

1000 = 7 1000 = 8 

S = 

A + B + 

COUT[MSB] S = A + B 

COUT is carry out 

OV = OV = 

XOR(COUT[MSB], XOR(COUT[MSB], 
COUT[MSBl]) COUT[MSB 1]) 



NA NA 



Z = B (negate); Z = B (negate); 

D = A + Z D = A + Z 



as in addition as in addition 




negation : 


Z = A; 




Z = A NA 


SG(Z) = Z = N0T ( A ) 


Z = NOT(A) + 1 


(negate) 


NOT(SG(A)) 





2.6.2 Adders 

We can view addition in terms of generate , G[ i ], and propagate , P[ i ], signals, 
method 1 method 2 

G[i] = A[i] • B[i] G[ i ] = A[ i ] • B[ i ] (2.42) 

P[i] = A[i] • B[ i P[i] = A[i]+B[i] (2.43) 

C[ i ] = G[ i ] + P[ i ] ■ C[ i 1] C[ i ] = G[ i ] + P[ i ] ■ C[ i 1] (2.44) 

S[ i ] = P[ i ] • C[ i 1] S[ i ] = A[ i ] • B[ i ] • C[ i 1] (2.45) 

where C[ i ] is the carry-out signal from stage i , equal to the cany in of stage ( i + 1). 
Thus, C[ i ] = COUT[ i ] = CIN[ i + 1], We need to be careful because C[0] might 
represent either the cany in or the cany out of the LSB stage. For an adder we set the 
carry in to the first stage (stage zero), C[l] or CIN[0], to 'O'. Some people use delete 
(D) or kill (K) in various ways for the complements of G[i] and P[i], but unfortunately 
others use C for COUT and D for CINso I avoid using any of these. Do not confuse the 
two different methods (both of which are used) in Eqs. 2.422.45 when forming the 
sum, since the propagate signal, P[ i ] , is different for each method. 

Figure 2.22(a) shows a conventional RCA. The delay of an n -bit RCA is proportional 
to n and is limited by the propagation of the carry signal through all of the stages. We 
can reduce delay by using pairs of go-faster bubbles to change AND and OR gates to 
fast two-input NAND gates as shown in Figure 2.22(a). Alternatively, we can write the 
equations for the cany signal in two different ways: 

either C[ i ] = A[ i ] • B[ i ] + P[ i ] ■ C[ i 1] (2.46) 

or C[ i ] = (A[ i ] + B[ i ] ) • (P[ i ]' + C[ i 1]), (2.47) 

where P[ i ]'= NOT(P[ i ]). Equations 2.46 and 2.47 allow us to build the cany chain 
from two-input NAND gates, one per cell, using different logic in even and odd stages 
(Figure 2.22b): 

even stages odd stages 

Cl[i]' = P[i]-C3[i 1] - C4[i 1] C3[i]' = P[i ] • Cl[i 1] • C2[i 1] (2.48) 

C2[i] = A[i ] + B[i ] C4[i]' = A[i ] • B[i ] (2.49) 

C[i] = Cl[i] - C2[i ] C[i] = C3[i ] ' + C4[i ]' (2.50) 

(the cany inputs to stage zero are C3[l] = C4[l] = '0'). We can use the RCA of 

Figure 2.22(b) in a datapath, with standard cells, or on a gate anay. 

Instead of propagating the carries through each stage of an RCA, Figure 2.23 shows a 
different approach. A cany-save adder ( CSA ) cell CSA(A1[ i ], A2[ i ], A3[ i ], CIN, 
Sl[i],S2[i], COUT) has three outputs: 

SI [ i ] = CIN , 

S2[ i ] = Al[ i ] • A2[ i ] • A3[ i ] = PARITY(A1[ i ], A2[ i ], A3[ i ]) , 



(2.51) 

(2.52) 




COUT = Al[ i ] • A2[ i ] + [(Al[ i ] + A2[ i ]) • A3[ i ]] = MAJ(A1[ i ], A2[ i ], 
A3[ i ]) . ( 



The inputs, Al, A2, and A3; and outputs, SI and S2, are buses. The input, CIN, is the 
carry from stage ( i 1). The carry in, CIN, is connected directly to the output bus SI 
indicated by the schematic symbol (Figure 2.23a). We connect CIN[0] to VSS. The 
output, COUT, is the carry out to stage ( i + 1). 

A 4-bit CSA is shown in Figure 2.23(b). The arithmetic overflow signal for ones 
complement or twos complement arithmetic, OV, is XOR(COUT[MSB], COUT[MSB 
1]) as shown in Figure 2.23(c). In a CSA the carries are saved at each stage and 
shifted left onto the bus SI. There is thus no carry propagation and the delay of a CSA 
is constant. At the output of a CSA we still need to add the S 1 bus (all the saved 
carries) and the S2 bus (all the sums) to get an n -bit result using a final stage that is not 
shown in Figure 2.23(c). We might regard the n -bit sum as being encoded in the two 
buses, SI and S2, in the form of the parity and majority functions. 

We can use a CSA to add multiple inputsas an example, an adder with four 4-bit inputs 
is shown in Figure 2.23(d). The last stage sums two input buses using a carry-propagate 
adder ( CPA ). We have used an RCA as the CPA in Figure 2.23(d) and (e), but we can 
use any type of adder. Notice in Figure 2.23(e) how the two CSA cells and the RCA 
cell abut together horizontally to form a bit slice (or slice) and then the slices are 
stacked vertically to form the datapath. 




A1 [ MS B :0] 
A2[MSB:0] 

A3[MSB:0] 

A4(MSB:0] 



A1[MSB:0] 

A2[MSB:0] 

A3[MSB:0] 

A4[MSB:0] 






FIGURE 2.22 The cany-save adder (CSA). (a) A CSA cell, (b) A 4-bit CSA. 

(c) Symbol for a CSA. (d) A four-input CSA. (e) The datapath for a four-input, 4-bit 
adder using CSAs with a ripple-cany adder (RCA) as the final stage, (f) A pipelined 
adder, (g) The datapath for the pipelined version showing the pipeline registers as well 
as the clock control lines that use m2. 

We can register the CSA stages by adding vectors of flip-flops as shown in 
Figure 2.23(f). This reduces the adder delay to that of the slowest adder stage, usually 
the CPA. By using registers between stages of combinational logic we use pipelining to 
increase the speed and pay a price of increased area (for the registers) and introduce 
latency . It takes a few clock cycles (the latency, equal to n clock cycles for an n -stage 
pipeline) to fill the pipeline, but once it is filled, the answers emerge every clock cycle. 
Ferris wheels work much the same way. When the fair opens it takes a while (latency) 
to fill the wheel, but once it is full the people can get on and off every few seconds. 

(We can also pipeline the RCA of Figure 2.20. We add i registers on the A and B 
inputs before ADD[ i ] and add (n i) registers after the output S[ i ], with a single 
register before each C[ i ].) 

The problem with an RCA is that every stage has to wait to make its cany decision, C[ 
i ], until the previous stage has calculated C[ i 1]. If we examine the propagate signals 
we can bypass this critical path. Thus, for example, to bypass the carries for bits 47 
(stages 58) of an adder we can compute BYPASS = P[4].P[5].P[6].P[7] and then use a 
MUX as follows: 

C[7] = (G[7] + P[7] • C[6]) • BYPASS' + C[3] • BYPASS . (2.54) 

Adders based on this principle are called cany-bypass adders ( CBA ) [Sato et al., 
1992], Farge, custom adders employ Manchester-cany chains to compute the carries 
and the bypass operation using TGs or just pass transistors [Weste and Eshraghian, 
1993, pp. 530531]. These types of carry chains may be part of a predesigned ASIC 
adder cell, but are not used by ASIC designers. 

Instead of checking the propagate signals we can check the inputs. For example we can 
compute SKIP = (A[ i 1] • B[ i 1]) + (A[ i ] • B[ i ] ) and then use a 2:1 MUX to 
select C[ i ]. Thus, 

CSKIP[ i ] = (G[ i ] + P[ i ] • C[ i 1]) • SKIP' + C[ i 2] • SKIP . (2.55) 

This is a cany-skip adder [Keutzer, Malik, and Saldanha, 1991; Fehman, 1961]. 
Cany-bypass and cany-skip adders may include redundant logic (since the cany is 
computed in two different wayswe just take the first signal to arrive). We must be 
careful that the redundant logic is not optimized away during logic synthesis. 

If we evaluate Eq. 2.44 recursively for i = 1, we get the following: 

C[l] = G[l] + P[l] • C[0] 

= G[l] + P[l] • (G[0] + P[l] • C[l]) 

= G[l] + P[ 1] • G[0] . (2.56) 

This result means that we can look ahead by two stages and calculate the cany into 
the third stage (bit 2), which is C[l], using only the first-stage inputs (to calculate G[0]) 
and the second-stage inputs. This is a cany-lookahead adder ( CFA ) [MacSorley, 




1961]. If we continue expanding Eq. 2.44, we find: 
C[2] = G[2] + P[2] • G[l] + P[2] • P[l] • G[0] , 



C[3] = G[3] + P[2] • G[2] + P[2] • P[l] • G[l] + P[3] • P[2] • P[l] • G[0] . (2.57) 

As we look ahead further these equations become more complex, take longer to 
calculate, and the logic becomes less regular when implemented using cells with a 
limited number of inputs. Datapath layout must fit in a bit slice, so the physical and 
logical structure of each bit must be similar. In a standard cell or gate array we are not 
so concerned about a regular physical structure, but a regular logical structure 
simplifies design. The BrentKung adder reduces the delay and increases the regularity 
of the carry-lookahead scheme [Brent and Kung, 1982], Figure 2.24(a) shows a regular 
4-bit CLA, using the cany-lookahead generator cell (CLG) shown in Figure 2.24(b). 














FIGURE 2.23 The BrentKung cany-lookahead adder (CLA). (a) Cany generadon in a 
4-bit CLA. (b) A cell to generate the lookahead terms, C[0]C[3], (c) Cells LI, L2, and 
L3 are rearranged into a tree that has less delay. Cell L4 is added to calculate C[2] that 
is lost in the translation, (d) and (e) Simplified representations of parts a and c. (f) The 
lookahead logic for an 8-bit adder. The inputs, 07, are the propagate and cany terms 
formed from the inputs to the adder, (g) An 8-bit BrentKung CLA. The outputs of the 
lookahead logic are the carry bits that (together with the inputs) form the sum. One 
advantage of this adder is that delays from the inputs to the outputs are more nearly 
equal than in other adders. This tends to reduce the number of unwanted and 
unnecessary switching events and thus reduces power dissipation. 

In a cany-select adder we duplicate two small adders (usually 4-bit or 8-bit adders 
often CLAs) for the cases CIN = 'O' and CIN = T and then use a MUX to select the 
case that we needwasteful, but fast [Bedrij, 1962], A cany-select adder is often used as 
the fast adder in a datapath library because its layout is regular. 

We can use the cany-select, cany -bypass, and cany-skip architectures to split a 12-bit 
adder, for example, into three blocks. The delay of the adder is then partly dependent 
on the delays of the MUX between each block. Suppose the delay due to 1-bit in an 
adder block (we shall call this a bit delay) is approximately equal to the MUX delay. In 
this case may be faster to make the blocks 3, 4, and 5-bits long instead of being equal in 
size. Now the delays into the final MUX are equal3 bit-delays plus 2 MUX delays for 
the cany signal from bits 06 and 5 bit-delays for the carry from bits 711. Adjusting 
the block size reduces the delay of large adders (more than 16 bits). 

We can extend the idea behind a cany-select adder as follows. Suppose we have an n 
-bit adder that generates two sums: One sum assumes a carry-in condition of 'O', the 
other sum assumes a carry-in condition of T. We can split this n -bit adder into an i -bit 
adder for the i LSBs and an ( n i )-bit adder for the n i MSBs. Both of the smaller 
adders generate two conditional sums as well as true and complement cany signals. 

The two (true and complement) cany signals from the LSB adder are used to select 
between the two ( n i + l)-bit conditional sums from the MSB adder using 2( n i + 1) 
two-input MUXes. This is a conditional- sum adder (also often abbreviated to CSA) 
[Sklansky, I960]. We can recursively apply this technique. For example, we can split a 
16-bit adder using i = 8 and n = 8; then we can split one or both 8bit adders againand 
so on. 

Figure 2.25 shows the simplest form of an n -bit conditional-sum adder that uses n 
single-bit conditional adders, H (each with four outputs: two conditional sums, true 
carry, and complement cany), together with a tree of 2:1 MUXes (QiJ). The 
conditional- sum adder is usually the fastest of all the adders we have discussed (it is the 
fastest when logic cell delay increases with the number of inputsthis is true for all 
ASICs except FPGAs). 




(a) Jty] JB[<] (c) 




C ij_k= catty in to the rth bit assum ingthe catty in to the /th bit is Ac (Ac = 0 or 1 J 
SiJ_k= sum at the *h bit assum ingthe catty in to the jth bit is Ac (Ac = 0 or 1 ) 



FIGURE 2.24 The conditional- sum adder, (a) A 1-bit conditional adder that calculates 
the sum and carry out assuming the carry in is either T or 'O', (b) The multiplexer that 
selects between sums and carries, (c) A 4-bit conditional-sum adder with carry input, 
C[0], 

2.6.3 A Simple Example 

How do we make and use datapath elements? What does a design look like? We may 
use predesigned cells from a library or build the elements ourselves from logic cells 
using a schematic or a design language. Table 2.12 shows an 8-bit conditional-sum 
adder intended for an FPGA. This Verilog implementation uses the same structure as 
Figure 2.25, but the equations are collapsed to use four or five variables. A basic logic 
cell in certain Xilinx FPGAs, for example, can implement two equations of the same 
four variables or one equation with five variables. The equations shown in Table 2.12 
requires three levels of FPGA logic cells (so, for example, if each FPGA logic cell has 
a 5 ns delay, the 8-bit conditional- sum adder delay is 15 ns). 

TABFE 2.12 An 8-bit conditional-sum adder (the notation is described in Figure 2.25). 








module m8bitCSum (CO, a, b, s, C8); // Verilog conditional-sum adder for an FPGA 
input [7:0] CO, a, b; output [7:0] s; output C8; 



wire 



A7,A6,A5,A4,A3,A2,A1,A0,B7,B6,B5,B4,B3,B2,B1,B0,S8,S7,S6,S5,S4,S3,S2,S1,S0; 
wire CO, C2, C4_2_0, C4_2_l, S5_4_0, S5_4_l, C6, C6_4_0, C6_4_l, C8; 



assign [A7,A6,A5,A4,A3,A2,A1,A0] = a; assign [B7,B6,B5,B4,B3,B2,B1,B0] = b; 
assign s = { S7,S6,S5,S4,S3,S2,S1,S0 }; 

assign SO = A0 A B0 A C0 ; // start of level 1: & = AND, A = XOR, I = OR, ! = NOT 

assign SI = A1 A B1 A (A0&B0I(A0IB0)&C0) ; 

assign C2 = A1&B1I(A1IB1)&(A0&B0I(A0IB0)&C0) ; 

assign C4_2_0 = A3&B3I(A3IB3)&(A2&B2) ; assign C4_2_l = 
A3&B3I(A3IB3)&(A2IB2) ; 

assign S5_4_0 = A5 A B5 A (A4&B4) ; assign S5_4_l = A5 A B5 A (A4IB4) ; 

assign C6_4_0 = A5&B5I(A5IB5)&(A4&B4) ; assign C6_4_l = 
A5&B5I(A5IB5)&(A4IB4) ; 

assign S2 = A2 A B2 A C2 ; // start of level 2 

assign S3 = A3 A B3 A (A2&B2I(A2IB2)&C2) ; 

assign S4 = A4 A B4 A (C4_2_0IC4_2_1&C2) ; 

assign S5 = S5_4_0& !(C4_2_0IC4_2_1&C2)IS5_4_1&(C4_2_0IC4_2_1&C2) ; 

assign C6 = C6_4_0IC6_4_1&(C4_2_0IC4_2_1&C2) ; 

assign S6 = A6 A B6 A C6 ; // start of level 3 

assign S7 = A7 A B7 A (A6&B6I(A6IB6)&C6) ; 

assign C8 = A7&B7l(A7IB7s)&(A6&B6l(A6IB6)&C6) ; 

endmodule 



Figure 2.26 shows the normalized delay and area figures for a set of predesigned 
datapath adders. The data in Figure 2.26 is from a series of ASIC datapath cell libraries 
(Compass Passport) that may be synthesized together with test vectors and simulation 
models. We can combine the different adder techniques, but the adders then lose 
regularity and become less suited to a datapath implementation. 




area>k). 2 



normalized 
delay 



2-input 
NAND = 1 



Jl 


* 


+ ripple-carry 


X cairy-select 


o carry-save 

t 






4 x 

+ X X 

1 8 & 


X 

X 


s 



8 16 



32 

(a) 



64 

bits 




FIGURE 2.25 Datapath adders. This data is from a series of submicron datapath 
libraries, (a) Delay normalized to a two-input NAND logic cell delay (approximately 
equal to 250 ps in a 0.5 m m process). For example, a 64-bit ripple-carry adder (RCA) 
has a delay of approximately 30 ns in a 0.5 m m process. The spread in delay is due to 
variation in delays between different inputs and outputs. An n -bit RCA has a delay 
proportional to n . The delay of an n -bit cany-select adder is approximately 
proportional to log 2 n . The cany-save adder delay is constant (but requires a 
carry -propagate adder to complete an addition), (b) In a datapath library the area of all 
adders are proportional to the bit size. 

There are other adders that are not used in datapaths, but are occasionally useful in 
ASIC design. A serial adder is smaller but slower than the parallel adders we have 
described [Denyer and Renshaw, 1985]. The carry-completion adder is a variable delay 
adder and rarely used in synchronous designs [Sklansky, I960]. 

2.6.4 Multipliers 

Figure 2.27 shows a symmetric 6-bit anay multiplier (an n -bit multiplier multiplies 
two n -bit numbers; we shall use n -bit by m -bit multiplier if the lengths are different). 
Adders aOfO may be eliminated, which then eliminates adders ala6, leaving an 
asymmetric CSA array of 30 (5 ¥ 6) adders (including one half adder). An n -bit array 
multiplier has a delay proportional to n plus the delay of the CPA (adders b6f6 in 
Figure 2.27). There are two items we can attack to improve the performance of a 
multiplier: the number of partial products and the addition of the partial products. 




A* A4 Aj A 2 Ai y, multiplicand 

B 5 B 4 B 3 B 2 B, B 0 multiplier 

summands % B 0 ^ 4 B 0 ^ 3 B 0 ^2 B 0 A i B 0 ^ B 0 

S ij- ( \ B i % B 1 ^ 4 B 1 ^ B 1 ^2 B 1 A 1 B 1 6o B 1 

%B 2 ^B 2 AjB 2 AjB 2 A,B 2 ^oB 2 partial products 

^B 3 A^ A^ A^ A 1 B 3 fyB 3 

^B 4 A 4 B 4 AjB 4 A 2 B 4 A f B 4 ^B 4 

^% B 5 fy B 5 % B 5 ^2 B 5 ^1 B 5 6 q B 5 




FIGURE 2.26 Multiplication. A 6-bit array multiplier using a final carry-propagate 
adder (full-adder cells a6f6, a ripple-carry adder). Apart from the generation of the 
summands this multiplier uses the same structure as the cany-save adder of 
Figure 2.23(d). 

Suppose we wish to multiply 15 (the multiplicand ) by 19 (the multiplier ) mentally. It 
is easier to calculate 15 ¥ 20 and subtract 15. In effect we complete the multiplication 
as 15 ¥ (20 1) and we could write this as 15 ¥ 2 1 , with the overbar representing a 
minus sign. Now suppose we wish to multiply an 8-bit binary number. A, by B = 
0001011 1 (decimal 16 + 4 + 2+1= 23). It is easier to multiply A by the canonical 
signed-digit vector ( CSD vector )D = 0010 1 001 (decimal 32 8 + 1 = 23) since this 
requires only three add or subtract operations (and a subtraction is as easy as an 
addition). We say B has a weight of 4 and D has a weight of 3. By using D instead of B 
we have reduced the number of partial products by 1 (= 4 3). 

We can recode (or encode) any binary number, B, as a CSD vector, D, as follows 
(canonical means there is only one CSD vector for any number): 

D j = B i + C j 2C i + ! , (2.58) 












where C j + i is the carry from the sum of B j + ± + B j + C j (we start with C q = 0). 



As another example, ifB = 011 (B 2 = 0, B | = 1, B 0 = 1; decimal 3), then, using 
Eq. 2.58, 

Dq — Bq + Cq 2C j = 1 + 0 2—1, 

D ! = B ! + C J 2C 2=1 + 1 2 = 0, 

D2 = B2 + C 2 2C 3 = 0+1 0=1, (2.59) 

so that D = 10 1 (decimal 4 1 = 3). CSD vectors are useful to represent fixed 
coefficients in digital filters, for example. 

We can recode using a radix other than 2. Suppose B is an ( n + l)-digit twos 
complement number, 

B = B 0 + B 1 2 + B 2 2 — h... + B i 2 i + ... + B n ^ 2 n ^ B n 2 n . (2.60) 



We can rewrite the expression for B using the following sleight-of-hand: 
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(2.61) 



This is very useful. Consider B = 101001 (decimal 9 32 = 23, n = 5), 

B = 101001 

= (2B t + B o )2 0 + (2B 3 + B 2 + B 1 )2 2 + (2B 5 + B 4 + B 3 )2 4 
((2¥0) + 1)2 ° + ((2¥ 1) + 0 + 0)2 2 + ((2¥ 1) + 0 + 1)2 4 . (2.62) 

Equation 2.61 tells us how to encode B as a radix-4 signed digit, E = 12 1 (decimal 16 
8 + 1 = 23). To multiply by B encoded as E we only have to perform a multiplication 
by 2 (a shift) and three add/subtract operations. 

Using Eq. 2.61 we can encode any number by taking groups of three bits at a time and 
calculating 

Ej = 2B j + B j i + B j 2 > 

Ej + i = 2B i + 2 + Bj+r+Bj,..., (2.63) 

where each 3-bit group overlaps by one bit. We pad B with a zero, B n . . . B i B 0 0, to 
match the first term in Eq. 2.61. If B has an odd number of bits, then we extend the 
sign: B n B n ...B 1 B 0 0. For example, B = 01011 (eleven), encodes to E = 1 1 1 (16 
4 1); and B = 101 is E = 1 1. This is called Booth encoding and reduces the number of 
partial products by a factor of two and thus considerably reduces the area as well as 
increasing the speed of our multiplier [Booth, 1951]. 

Next we turn our attention to improving the speed of addition in the CSA array. 




Figure 2.28(a) shows a section of the 6-bit array multiplier from Figure 2.27. We can 
collapse the chain of adders a0f5 (5 adder delays) to the Wallace tree consisting of 
adders 5.15.4 (4 adder delays) shown in Figure 2.28(b). 




catry-save chain Wallace tree 

(a) (b) 

FIGURE 2.27 Tree-based multiplication, (a) The portion of Figure 2.27 that calculates 
the sum bit, P 5 , using a chain of adders (cells a0f5). (b) We can collapse this chain to 
a Wallace tree (cells 5.15.5). (c) The stages of multiplication. 

Figure 2.28(c) pictorially represents multiplication as a sort of golf course. Each link 
corresponds to an adder. The holes or dots are the outputs of one stage (and the inputs 
of the next). At each stage we have the following three choices: (1) sum three outputs 
using a full adder (denoted by a box enclosing three dots); (2) sum two outputs using a 
half adder (a box with two dots); (3) pass the outputs directly to the next stage. The two 
outputs of an adder are joined by a diagonal line (full adders use black dots, half adders 
white dots). The object of the game is to choose (1), (2), or (3) at each stage to 
maximize the performance of the multiplier. In tree-based multipliers there are two 
ways to do thisworking forward and working backward. 

In a Wallace-tree multiplier we work forward from the multiplier inputs, compressing 
the number of signals to be added at each stage [Wallace, I960]. We can view an FA as 
a 3:2 compressor or (3, 2) counter it counts the number of 'l's on the inputs. Thus, for 
example, an input of TOT (two Ts) results in an output TO' (2). A half adder is a (2, 2) 
counter . To form P 5 in Figure 2.29 we must add 6 summands (S 05 , S 14 , S 23 , S 32 , 

S 4 2 , and S 50 ) and 4 carries from the P 4 column. We add these in stages 17, 
compressing from 6:3:2:2:3:1:1. Notice that we wait until stage 5 to add the last cany 








from column P 4 , and this means we expand (rather than compress) the number of 
signals (from 2 to 3) between stages 3 and 5. The maximum delay through the CSA 
array of Figure 2.29 is 6 adder delays. To this we must add the delay of the 4-bit (9 
inputs) CPA (stage 7). There are 26 adders (6 half adders) plus the 4 adders in the CPA. 




FIGURE 2.28 A 6 -bit Wallace-tree multiplier. The cany-save adder (CSA) requires 26 
adders (cells 126, six are half adders). The final cany -propagate adder (CPA) consists 
of 4 adder cells (2730). The delay of the CSA is 6 adders. The delay of the CPA is 4 
adders. 

In a Dadda multiplier (Figure 2.30) we work backward from the final product [Dadda, 
1965]. Each stage has a maximum of 2, 3, 4, 6 , 9, 13, 19, . . . outputs (each successive 
stage is 3/2 times largerrounded down to an integer). Thus, for example, in 
Figure 2.28(d) we require 3 stages (with 3 adder delaysplus the delay of a 10-bit output 
CPA) for a 6 -bit Dadda multiplier. There are 19 adders (4 half adders) in the CSA plus 
the 10 adders (2 half adders) in the CPA. A Dadda multiplier is usually faster and 
smaller than a Wallace-tree multiplier. 
























FIGURE 2.29 The 6-bit Dadda multiplier. The cany-save adder (CSA) requires 20 
adders (cells 120, four are half adders). The carry-propagate adder (CPA, cells 2130) 
is a ripple-cany adder (RCA). The CSA is smaller (20 versus 26 adders), faster (3 
adder delays versus 6 adder delays), and more regular than the Wallace-tree CSA of 
Figure 2.29. The overall speed of this implementation is approximately the same as the 
Wallace-tree multiplier of Figure 2.29; however, the speed may be increased by 
substituting a faster CPA. 

In general, the number of stages and thus delay (in units of an FA delayexcluding the 
CPA) for an n -bit tree-based multiplier using (3, 2) counters is 

log L5 n = log 10 n /log 10 1.5 = log 10 n /0. 176 . (2.64) 



Figure 2.31(a) shows how the partial-product anay is constructed in a conventional 
4-bit multiplier. The FerrariStefanelli multiplier (Figure 2.31b) nests multipliersthe 
2-bit submultipliers reduce the number of partial products [Ferrari and Stefanelli, 
1969], 
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FIGURE 2.30 FerrariStefanelli multiplier, (a) A conventional 4-bit anay multiplier 
using AND gates to calculate the summands with (2, 2) and (3, 2) counters to sum the 
partial products, (b) A 4-bit FerrariStefanelli multiplier using 2-bit submultipliers to 
construct the partial product anay. (c) A circuit implementation for an inverting 2-bit 
submultiplier. 



There are several issues in deciding between parallel multiplier architectures: 

1. Since it is easier to fold triangles rather than trapezoids into squares, a 
Wallace-tree multiplier is more suited to full-custom layout, but is slightly larger, 
than a Dadda multiplierboth are less regular than an array multiplier. For 
cell-based ASICs, a Dadda multiplier is smaller than a Wallace-tree multiplier. 

2. The overall multiplier speed does depend on the size and architecture of the final 
CPA, but this may be optimized independently of the CSA array. This means a 
Dadda multiplier is always at least as fast as the Wallace-tree version. 

3. The low-order bits of any parallel multiplier settle first and can be added in the 
CPA before the remaining bits settle. This allows multiplication and the final 
addition to be overlapped in time. 

4. Any of the parallel multiplier architectures may be pipelined. We may also use a 
variably pipelined approach that tailors the register locations to the size of the 
multiplier. 

5. Using (4, 2), (5, 3), (7, 3), or (15, 4) counters increases the stage compression 
and permits the size of the stages to be tuned. Some ASIC cell libraries contain a 
(7, 3) countera 2-bit full-adder . A (15, 4) counter is a 3-bit full adder. There is a 
trade-off in using these counters between the speed and size of the logic cells and 
the delay as well as area of the interconnect. 

6. Power dissipation is reduced by the tree-based structures. The simplified 
carry-save logic produces fewer signal transitions and the tree structures produce 
fewer glitches than a chain. 

7. None of the multiplier structures we have discussed take into account the 
possibility of staggered arrival times for different bits of the multiplicand or the 
multiplier. Optimization then requires a logic-synthesis tool. 

2.6.5 Other Arithmetic Systems 

There are other schemes for addition and multiplication that are useful in special 
circumstances. Addition of numbers using redundant binary encoding avoids carry 
propagation and is thus potentially very fast. Table 2.13 shows the rules for addition 
using an intermediate carry and sum that are added without the need for carry. For 
example, 

binary decimal redundant binary CSD vector 

1010111 87 10101001 10 10 1 00 1 addend 

+ 1100101 101 + 11100111 +01100101 augend 

01001110 = 1100 1 100 intermediate sum 

1 1 00010 1 11000000 intermediate carry 

= 10111100= 188 1 1 1000 1 00 10 1 00 1 100 sum 

TABLE 2.13 Redundant binary addition. 
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A[i]B[i]A[i 1] B[i 1] 

sum carry 

1 1 x x 0 1 
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The redundant binary representation is not unique. We can represent 101 (decimal), for 
example, by 1100101 (binary and CSD vector) or 1 1 100111. As another example, 188 
(decimal) can be represented by 10111100 (binary), 1 1 1000 1 00, 10 1 00 1 100, or 10 

1 000 1 00 (CSD vector). Redundant binary addition of binary, redundant binary, or 
CSD vectors does not result in a unique sum, and addition of two CSD vectors does not 
result in a CSD vector. Each n -bit redundant binary number requires a rather wasteful 

2 n -bit binary number for storage. Thus 10 1 is represented as 010010, for example 
(using sign magnitude). The other disadvantage of redundant binary arithmetic is the 
need to convert to and from binary representation. 

Table 2.14 shows the (5, 3) residue number system . As an example, 11 (decimal) is 
represented as [1,2] residue (5, 3) since 1 1R 5 = 11 mod 5 = 1 and 1 1R 3 = 11 mod 3 = 
2. The size of this system is thus 3¥5 = 15.We add, subtract, or multiply residue 
numbers using the modulus of each bit positionwithout any cany. Thus: 

4 [4, 1] 12 [2, 0] 3 [3, 0] 

+ 7 + [2, 1] 4 - [4, 1] ¥ 4 ¥ [4, 1] 

= 11 = [1,2] = 8= [3, 2] =12= [2, 0] 



TABLE 2.14 The 5, 3 residue number system, 
n residue 5 residue 3 n residue 5 residue 3 n residue 5 residue 3 
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The choice of moduli determines the system size and the computing complexity. The 
most useful choices are relative primes (such as 3 and 5). With p prime, numbers of the 
form 2 P and 2 P 1 are particularly useful (2 P 1 are Mersennes numbers ) [Waser and 
Flynn, 1982], 

2.6.6 Other Datapath Operators 

Figure 2.32 shows symbols for some other datapath elements. The combinational 
datapath cells, NAND, NOR, and so on, and sequential datapath cells (flip-flops and 
latches) have standard-cell equivalents and function identically. I use a bold outline (1 
point) for datapath cells instead of the regular (0.5 point) line I use for scalar symbols. 
We call a set of identical cells a vector of datapath elements in the same way that a bold 
symbol, A , represents a vector and A represents a scalar. 





(d) (e) (1) (g) (h) 



FIGURE 2.31 Symbols for datapath elements, (a) An array or vector of flip-flops (a 
register), (b) A two-input NAND cell with databus inputs, (c) A two-input NAND cell 
with a control input, (d) A buswide MUX. (e) An incrementer/decrementer. (f) An 
all-zeros detector, (g) An all-ones detector, (h) An adder/subtracter. 

A subtracter is similar to an adder, except in a full subtracter we have a borrow-in 
signal, BIN; a borrow-out signal, BOUT; and a difference signal, DIFF: 

DIFF = A • NOT(B) • NOT( BIN) 

SUM(A, NOT(B), NOT(BIN)) (2.65) 

NOT(BOUT) = A • NOT(B) + A • NOT(BIN) + NOT(B) • NOT(BIN) 

MAJ(NOT(A), B, NOT(BIN)) (2.66) 

These equations are the same as those for the FA (Eqs. 2.38 and 2.39) except that the B 
input is inverted and the sense of the carry chain is inverted. To build a subtracter that 
calculates (A B) we invert the entire B input bus and connect the BIN[0] input to 
VDD (not to VSS as we did for CIN[0] in an adder). As an example, to subtract B = 
'001 F from A = TOOT we calculate TOOT + '1100' + T = '0110'. As with an adder, the 
true overflow is XOR(BOUT[MSB], BOUT[MSB 1]). 

We can build a ripple-borrow subtracter (a type of borrow-propagate subtracter), a 
borrow-save subtracter, and a borrow-select subtracter in the same way we built these 
adder architectures. An adder/subtracter has a control signal that gates the A input with 
an exclusive-OR cell (forming a programmable inversion) to switch between an adder 
or subtracter. Some adder/subtracters gate both inputs to allow us to compute (A B). 
We must be careful to connect the input to the FSB of the carry chain (CIN[0] or 
BIN[0]) when changing between addition (connect to VSS) and subtraction (connect to 
VDD). 

A barrel shifter rotates or shifts an input bus by a specified amount. For example if we 
have an eight-input barrel shifter with input Till 0000' and we specify a shift of 
'0001 0000' (3, coded by bit position) the right-shifted 8-bit output is '0001 1 1 10'. A 
barrel shifter may rotate left or right (or switch between the two under a separate 
control). A barrel shifter may also have an output width that is smaller than the input. 
To use a simple example, we may have an 8-bit input and a 4-bit output. This situation 
is equivalent to having a barrel shifter with two 4-bit inputs and a 4-bit output. Barrel 
shifters are used extensively in floating-point arithmetic to align (we call this normalize 
and denormalize ) floating-point numbers (with sign, exponent, and mantissa). 





A leading-one detector is used with a normalizing (left-shift) barrel shifter to align 
mantissas in floating-point numbers. The input is an n -bit bus A, the output is an n -bit 
bus, S, with a single T in the bit position corresponding to the most significant T in 
the input. Thus, for example, if the input is A = '0000 0101' the leading-one detector 
output is S = '0000 0100', indicating the leading one in A is in bit position 2 (bit 7 is the 
MSB, bit zero is the LSB). If we feed the output, S, of the leading-one detector to the 
shift select input of a normalizing (left- shift) barrel shifter, the shifter will normalize 
the input A. In our example, with an input of A = '0000 0101', and a left-shift of S = 
'0000 0100', the barrel shifter will shift A left by five bits and the output of the shifter is 
Z = '1010 0000'. Now that Z is aligned (with the MSB equal to T) we can multiply Z 
with another normalized number. 

The output of a priority encoder is the binary-encoded position of the leading one in an 
input. For example, with an input A = '0000 0101' the leading 1 is in bit position 3 
(MSB is bit position 7) so the output of a 4-bit priority encoder would be Z = '001 1' (3). 
In some cell libraries the encoding is reversed so that the MSB has an output code of 
zero, in this case Z = '0101' (5). This second, reversed, encoding scheme is useful in 
floating-point arithmetic. If A is a mantissa and we normalize A to '1010 0000' we have 
to subtract 5 from the exponent, this exponent correction is equal to the output of the 
priority encoder. 

An accumulator is an adder/subtracter and a register. Sometimes these are combined 
with a multiplier to form a multiplieraccumulator ( MAC ). An incrementer adds 1 to 
the input bus, Z = A + 1, so we can use this function, together with a register, to negate 
a twos complement number for example. The implementation is Z[ i ] = XOR(A[ i ], 
CIN[ i ]), and COUT[ i ] = AND(A[ i ], CIN[ i ]). The carry-in control input, CIN[0], 
thus acts as an enable: If it is set to 'O' the output is the same as the input. 

The implementation of arithmetic cells is often a little more complicated than we have 
explained. CMOS logic is naturally inverting, so that it is faster to implement an 
incrementer as 

Z[ i (even)] = XOR(A[ i ], CIN[ i ]) and COUT[ i (even)] = NAND(A[ i ], CIN[ i ]). 

This inverts COUT, so that in the following stage we must invert it again. If we push an 
inverting bubble to the input CIN we find that: 

Z[ i (odd)] = XNOR(A[ i ], CIN[ i ]) and COUT[ i (even)] = NOR(NOT(A[ i ]), CIN[ i 
])• 

In many datapath implementations all odd-bit cells operate on inverted carry signals, 
and thus the odd-bit and even-bit datapath elements are different. In fact, all the adder 
and subtracter datapath elements we have described may use this technique. Normally 
this is completely hidden from the designer in the datapath assembly and any output 
control signals are inverted, if necessary, by inserting buffers. 

A decrementer subtracts 1 from the input bus, the logical implementation is Z[ i ] = 
XOR(A[ i ], CIN[ i ]) and COUT[ i ] = AND(NOT(A[ i ]), CIN[ i ]). The 
implementation may invert the odd cany signals, with CIN[0] again acting as an 
enable. 

An incrementer/decrementer has a second control input that gates the input, inverting 




the input to the carry chain. This has the effect of selecting either the increment or 
decrement function. 

Using the all-zeros detectors and all-ones detectors , remember that, for a 4-bit number, 
for example, zero in ones complement arithmetic is T 1 11' or '0000', and that zero in 
signed magnitude arithmetic is '1000' or '0000'. 

A register file (or scratchpad memory) is a bank of flip-flops arranged across the bus; 
sometimes these have the option of multiple ports (multiport register files) for read and 
write. Normally these register files are the densest logic and hardest to fit in a datapath. 
For large register files it may be more appropriate to use a multiport memory. We can 
add control logic to a register file to create a first-in first-out register ( FIFO ), or last-in 
first-out register ( LIFO ). 

In Section 2.5 we saw that the standard-cell version and gate-array macro version of the 
sequential cells (latches and flip-flops) each contain their own clock buffers. The 
reason for this is that (without intelligent placement software) we do not know where a 
standard cell or a gate-array macro will be placed on a chip. We also have no idea of 
the condition of the clock signal coming into a sequential cell. The ability to place the 
clock buffers outside the sequential cells in a datapath gives us more flexibility and 
saves space. For example, we can place the clock buffers for all the clocked elements at 
the top of the datapath (together with the buffers for the control signals) and river route 
(in river routing the interconnect lines all flow in the same direction on the same layer) 
the connections to the clock lines. This saves space and allows us to guarantee the 
clock skew and timing. It may mean, however, that there is a fixed overhead associated 
with a datapath. For example, it might make no sense to build a 4-bit datapath if the 
clock and control buffers take up twice the space of the datapath logic. Some tools 
allow us to design logic using a portable netlist . After we complete the design we can 
decide whether to implement the portable netlist in a datapath, standard cells, or even a 
gate array, based on area, speed, or power considerations. 




2.7 I/O Cells 



Figure 2.33 shows a three-state bidirectional output buffer (Tri-State ® is a 
registered trademark of National Semiconductor). When the output enable (OE) 
signal is high, the circuit functions as a noninverting buffer driving the value of 
DAT Ain onto the I/O pad. When OE is low, the output transistors or drivers , Ml 
and M2, are disconnected. This allows multiple drivers to be connected on a bus. 
It is up to the designer to make sure that a bus never has two driversa problem 
known as contention . 

In order to prevent the problem opposite to contentiona bus floating to an 
intermediate voltage when there are no bus driverswe can use a bus keeper or 
bus-hold cell (TI calls this Bus-Friendly logic). A bus keeper normally acts like 
two weak (low drive-strength) cross-coupled inverters that act as a latch to retain 
the last logic state on the bus, but the latch is weak enough that it may be driven 
easily to the opposite state. Even though bus keepers act like latches, and will 
simulate like latches, they should not be used as latches, since their drive strength 
is weak. 

Transistors Ml and M2 in Figure 2.33 have to drive large off-chip loads. If we 
wish to change the voltage on a C = 200 pF load by 5 V in 5 ns (a slew rate of 1 
Vns 1 ) we will require a current in the output transistors of I DS = C (d V /d t ) = 
(200 ¥ 10 12 ) (5/5 ¥ 10 9 ) = 0.2 A or 200 mA. 

Such large currents flowing in the output transistors must also flow in the power 
supply bus and can cause problems. There is always some inductance in series 
with the power supply, between the point at which the supply enters the ASIC 
package and reaches the power bus on the chip. The inductance is due to the bond 
wire, lead frame, and package pin. If we have a power-supply inductance of 2 nH 
and a current changing from zero to 1 A (32 I/O cells on a bus switching at 30 
mA each) in 5 ns, we will have a voltage spike on the power supply (called 
power-supply bounce ) of L (d I /d t ) = (2 ¥ 10 9 )(l/(5 ¥ 10 9 )) = 0.4 V. 

We do several things to alleviate this problem: We can limit the number of 
simultaneously switching outputs (SSOs), we can limit the number of I/O drivers 
that can be attached to any one VDD and GND pad, and we can design the output 
buffer to limit the slew rate of the output (we call these slew-rate limited I/O 
pads). Quiet-I/O cells also use two separate power supplies and two sets of I/O 
drivers: an AC supply (clean or quiet supply) with small AC drivers for the I/O 
circuits that start and stop the output slewing at the beginning and end of a output 
transition, and a DC supply (noisy or dirty supply) for the transistors that handle 




large currents as they slew the output. 



The three-state buffer allows us to employ the same pad for input and output 
bidirectional I/O . When we want to use the pad as an input, we set OE low and 
take the data from DAT Ain. Of course, it is not necessary to have all these 
features on every pad: We can build output-only or input-only pads. 



FIGURE 2.32 A three- state bidirectional 
output buffer. When the output enable, 
OE, is T the output section is enabled 
and drives the I/O pad. When OE is 'O' 
the output buffer is placed in a 
high-impedance state. 




We can also use many of these output cell features for input cells that have to 
drive large on-chip loads (a clock pad cell, for example). Some gate arrays 
simply turn an output buffer around to drive a grid of interconnect that supplies a 
clock signal internally. With a typical interconnect capacitance of 0.2pFcm 1 , a 
grid of 100 cm (consisting of 10 by 10 lines running all the way across a 1 cm 
chip) presents a load of 20 pF to the clock buffer. 

Some libraries include I/O cells that have passive pull-ups or pull-downs 
(resistors) instead of the transistors, Ml and M2 (the resistors are normally still 
constructed from transistors with long gate lengths). We can also omit one of the 
driver transistors, Ml or M2, to form open-drain outputs that require an external 
pull-up or pull-down. We can design the output driver to produce TTL output 
levels rather than CMOS logic levels. We may also add input hysteresis (using a 
Schmitt trigger) to the input buffer, II in Figure 2.33, to accept input data signals 
that contain glitches (from bouncing switch contacts, for example) or that are 
slow rising. The input buffer can also include a level shifter to accept TTL input 
levels and shift the input signal to CMOS levels. 

o 

The gate oxide in CMOS transistors is extremely thin (100 A or less). This leaves 
the gate oxide of the I/O cell input transistors susceptible to breakdown from 
static electricity ( electrostatic discharge , or ESD ). ESD arises when we or 
machines handle the package leads (like the shock I sometimes get when I touch 
a doorknob after walking across the carpet at work). Sometimes this problem is 
called electrical overstress (EOS) since most ESD-related failures are caused not 
by gate-oxide breakdown, but by the thermal stress (melting) that occurs when 
the n -channel transistor in an output driver overheats (melts) due to the large 
current that can flow in the drain diffusion connected to a pad during an ESD 
event. 




To protect the I/O cells from ESD, the input pads are normally tied to device 
structures that clamp the input voltage to below the gate breakdown voltage 

o 

(which can be as low as 10 V with a 100 A gate oxide). Some I/O cells use 
transistors with a special ESD implant that increases breakdown voltage and 
provides protection. I/O driver transistors can also use elongated drain structures 
(ladder structures) and large drain-to-gate spacing to help limit current, but in a 
salicide process that lowers the drain resistance this is difficult. One solution is to 
mask the I/O cells during the salicide step. Another solution is to use pnpn and 
npnp diffusion structures called silicon-controlled rectifiers (SCRs) to clamp 
voltages and divert current to protect the I/O circuits from ESD. 

There are several ways to model the capability of an I/O cell to withstand EOS. 
The human-body model ( HBM ) represents ESD by a 100 pF capacitor 
discharging through a 1.5 k W resistor (this is an International Electrotechnical 
Committee, IEC, specification). Typical voltages generated by the human body 
are in the range of 24 kV, and we often see an I/O pad cell rated by the voltage it 
can withstand using the HBM. The machine model ( MM ) represents an ESD 
event generated by automated machine handlers. Typical MM parameters use a 
200 pF capacitor (typically charged to 200 V) discharged through a 25 W 
resistor, corresponding to a peak initial current of nearly 10 A. The charge-device 
model ( CDM , also called device chargedischarge) represents the problem when 
an IC package is charged, in a shipping tube for example, and then grounded. If 
the maximum charge on a package is 3 nC (a typical measured figure) and the 
package capacitance to ground is 1.5 pF, we can simulate this event by charging a 
1.5 pF capacitor to 2 kV and discharging it through a 1 W resistor. 

If the diffusion structures in the I/O cells are not designed with care, it is possible 
to construct an SCR structure unwittingly, and instead of protecting the 
transistors the SCR can enter a mode where it is latched on and conducting large 
enough currents to destroy the chip. This failure mode is called latch-up . 

Latch-up can occur if the pn -diodes on a chip become forward- biased and inject 
minority carriers (electrons in p -type material, holes in n -type material) into the 
substrate. The sourcesubstrate and drainsubstrate diodes can become 
forward-biased due to power-supply bounce or output undershoot (the cell 
outputs fall below V ss ) or overshoot (outputs rise to greater than V DD ) for 
example. These injected minority carriers can travel fairly large distances and 
interact with nearby transistors causing latch-up. I/O cells normally surround the 
I/O transistors with guard rings (a continuous ring of n -diffusion in an n -well 
connected to VDD, and a ring of p -diffusion in a p -well connected to VSS) to 
collect these minority carriers. This is a problem that can also occur in the logic 
core and this is one reason that we normally include substrate and well 
connections to the power supplies in every cell. 




2.8 Cell Compilers 

The process of hand crafting circuits and layout for a full-custom IC is a tedious, 
time-consuming, and error-prone task. There are two types of automated layout 
assembly tools, often known as a silicon compilers . The first type produces a 
specific kind of circuit, a RAM compiler or multiplier compiler , for example. 

The second type of compiler is more flexible, usually providing a programming 
language that assembles or tiles layout from an input command file, but this is 
full-custom IC design. 

We can build a register file from latches or flip-flops, but, at 4.56.5 gates (1826 
transistors) per bit, this is an expensive way to build memory. Dynamic RAM 
(DRAM) can use a cell with only one transistor, storing charge on a capacitor 
that has to be periodically refreshed as the charge leaks away. ASIC RAM is 
invariably static (SRAM), so we do not need to refresh the bits. When we refer to 
RAM in an ASIC environment we almost always mean SRAM. Most ASIC 
RAMs use a six-transistor cell (four transistors to form two cross-coupled 
inverters that form the storage loop, and two more transistors to allow us to read 
from and write to the cell). RAM compilers are available that produce single-port 
RAM (a single shared bus for read and write) as well as dual-port RAMs , and 
multiport RAMs . In a multi-port RAM the compiler may or may not handle the 
problem of address contention (attempts to read and write to the same RAM 
address simultaneously). RAM can be asynchronous (the read and write cycles 
are triggered by control and/or address transitions asynchronous to a clock) or 
synchronous (using the system clock). 

In addition to producing layout we also need a model compiler so that we can 
verify the circuit at the behavioral level, and we need a netlist from a netlist 
compiler so that we can simulate the circuit and verify that it works correctly at 
the structural level. Silicon compilers are thus complex pieces of software. We 
assume that a silicon compiler will produce working silicon even if every 
configuration has not been tested. This is still ASIC design, but now we are 
relying on the fact that the tool works correctly and therefore the compiled blocks 
are correct by construction . 




2.9 Summary 

The most important concepts that we covered in this chapter are the following: 

• The use of transistors as switches 

• The difference between flip-flop and a latch 

• The meaning of setup time and hold time 

• Pipelines and latency 

• The difference between datapath, standard-cell, and gate-array logic cells 

• Strong and weak logic levels 

• Pushing bubbles 

• Ratio of logic 

• Resistance per square of layers and their relative values in CMOS 

• Design rules and 1 




2.10 Problems 



* = Difficult,** = Very difficult, *** = Extremely difficult 

2.1 (Switches, 20 min.) (a) Draw a circuit schematic for a two-way light switch: 
flipping the switch at the top or bottom of the stairs reverses the state of two light 
bulbs, one at the top and one at the bottom of the stairs. Your schematic should 
show and label all the cables, switches, and bulbs, (b) Repeat the problem for 
three switches and one light in a warehouse. 

2.2 (Logic, 10 min.) The queen wished to choose her successor wisely. She 
blindfolded and then placed a crown on each of her three children, explaining that 
there were three red and two blue crowns, and they must deduce the color of their 
own crown. With blindfolds removed the children could see the two other 
crowns, but not their own. After a while Anne said: My crown is red. How did 
she know? 

2.3 (Minus signs, 20 min.) The channel charge in an n -channel transistor is 
negative, (a) Should there not be a minus sign in Eq. 2.5 to account for this? (b) 
If so, then where in the derivation of Section 2.1 does the minus sign disappear 
to arrive at Eq. 2.9 for the current in an n -channel transistor? (c) The equations 
for the current in a p -channel transistor (Eq. 2.15) have the opposite sign to those 
for an n -channel transistor. Where in the derivation in Section 2.1 does the extra 
minus sign arise? 



FIGURE 2.33 Transistor characteristics for a 
0.3 m m process (Problem 2.4). 



IqsM A 0.3um,20j£0 mh. 




2.4 (Transistor curves, 20 min.) Figure 2.34 shows the measured I ds V D s 
characteristics for a 20/20 n -channel transistor in a 0.3 m m (effective gate 
length) process from an ASIC foundry. Derive as much information as you can 
from this figure. 

2.5 (Body effect, 20 min). The equations for the drainsource current (2.9, 2.12, 





and 2.15) do not contain V SB , the source voltage with respect to the bulk, 
because we assumed that it was zero. This is not true for the n -channel transistor 
whose drain is connected to the output in a two-input NAND gate, for example. 

A reverse substrate bias (or back-gate bias; V SB > 0 for an n -channel transistor) 
makes the bulk act like a second gate (the back gate) and modifies an n -channel 
transistor threshold voltage as follows: 

V tn- V t0n + g[( f 0 + V SB) f 0 1 ’ (2.67) 

where V 1 0 n is measured with V SB = 0 V; f 0 is called the surface potential; and 
g (gamma) is the body-effect coefficient (back-gate bias coefficient), 

g = (2q e si N A VC ox • (2.68) 

There are several alternative names and symbols for f 0 (phi, a positive quantity 
for an n -channel transistor, typically between 0.60.7 V)you may also see f b (for 
bulk potential) or 2 f F (twice the Fermi potential, a negative quantity). In 
Eq. 2.68, e si = e 0 e r = 1.053 ¥ 10 10 Fm 1 is the permittivity of silicon (the 
permittivity of a vacuum e 0 = 8.85 ¥ 10 12 Fm 1 and the relative permittivity of 
silicon is e r = 1 1.7); N A is the acceptor doping concentration in the bulk (for p 
-type substrate or well N D for the donor concentration in an n -type substrate or 
well); and C ox is the gate capacitance per unit area given by 

c ox = e ox /T ox . (2.69) 

• a. Calculate the theoretical value of g for N A = 10 16 cm 3 , T ox = 100 A. 

• b. Calculate and plot V t n for V SB ranging from 0 V to 5 V in increments 
of 1 V assuming values of g = 0.5 V 0 5 , f 0 = 0.6 V, and V t0 n = 0.5 V 
obtained from transistor characteristics. 

• c. Fit a linear approximation to V t n . 

• d. Recognizing V SB £ 0 V, rewrite Eq. 2.67 for a p -channel device. 

• e. (Harder) What effect does the back-gate bias effect have on CMOS logic 
circuits? 

Answer: (a) 0.17 V 0- 5 (b) 0.50 1.3 V. 

2.6 (Sizing layout, 10 min.) Stating clearly whatever assumptions you make and 
describing the tools and methods you use, estimate the size (in 1 ) of the standard 
cell shown in Figure 1.3. Estimate the size of each of the transistors, giving their 
channel lengths and widths (stating clearly which is which). 

2.7 (CMOS process) (20 min.) Table 2.15 shows the major steps involved in a 
typical deep submicron CMOS process. There are approximately 100 major steps 
in the process. 

• a. If each major step has a yield of 0.9, what is the overall process yield? 




b. If the process yield is 90 % (not uncommon), what is the average yield 
at each major step? 

c. If each of the major steps in Table 2.15 consists of an average of five 
other microtasks, what is the average yield of each of the 500 micro tasks. 

d. Suppose, for example, an operator loads and unloads a furnace five 
times a day as a microtask, how many days must the operator work without 
making a mistake to achieve this microtask yield? 

e. Does this seem reasonable? What is wrong with our model? 

f. (**60 min.) Draw the process cross-section showing, in particular, the 
poly, FOX, gate oxide, IMOs and metal layers. You may have to make 
some assumptions about the meanings and functions of the various steps 
and layers. Assume all layers are deposited on top of each other according 
to the thicknesses shown (do not attempt to correct for the silicon 
consumed during oxidationeven if you understand what this means). The 
abbreviations in Table 2.15 are as follows: dep. = deposition; LPCVD = 
low-pressure chemical vapor deposition (for growing oxide and poly); 

LDD = lightly doped drain (a way to improve transistor characteristics); 
SOG = silicon overglass (a deposited quartz to help with step coverage 
between metal layers). 



TABLE 2.15 CMOS process steps (Problem 2.7). J_ 





Step 


Depth 


Step 


Depth 


Step 


Depth 


1 


substrate 




32 resist strip 




63 ml mask 




2 


oxide 1 dep. 


500 


33 WSi anneal 




64 ml etch 




3 


nitride 1 
dep. 


1500 


34 nLDD mask 




65 resist strip 




4 


n-well mask 




35 nLDD implant 




^ base oxide 
66 , 
dep. 


6000 


5 


n-well etch 




36 resist strip 




67 SOG coatl/2 


3000 


6 


n-well 

implant 




37 pLDD mask 




68 S0G 
cure/etch 


4000 


7 


resist strip 




38 pLDD implant 




cap oxide 
69 , 
dep. 


4000 


8 


blocking 
oxide dep. 


2000 


39 resist strip 




70 vial mask 




9 


nitride 1 
strip 




spacer oxide 
dep. 


3000 


71 vial etch 


2500 


10 


p-well 

implant 




41 WSi anneal 




72 resist strip 




11 


p-well drive 




42 SD oxide dep 


200 


73 TiW dep. 


2000 


12 


active oxide 
dep. 


250 


43 n-i- mask 




AlCu/TiW 

dep. 


4000 



nitride 2 
dep. 


1500 


44 n-i- implant 




75 m2 mask 




14 active mask 




45 resist strip 




76 m2 etch 




15 active etch 




46 ESD mask 




77 resist strip 




16 resist strip 




47 ESD implant 




base oxide 
78 dep. 


6000 


17 field mask 




48 resist strip 




79 SOG coat 1/2 


3000 


18 field implant 




49 p-i- mask 




80 S0G 
cure/etch 


4000 


19 resist strip 




50 p-i- implant 




„ . cap oxide 
81 dep. 


4000 


field oxide 
dep. 


5000 


5 1 resist strip 




82 via2 mask 




9 1 nitride 2 
strip 




52 implant anneal 




83 via2 etch 


2500 


22 sacrificial 
oxide dep. 


300 


LPCVD oxide 
dep. 


1500 


84 resist strip 




22 Vt adjust 
implant 




54 BPSG 
dep./densify 


4000 


85 TiW dep. 


2000 


9 . gate oxide 
dep. 


80 


55 contact mask 




0 , AlCu/TiW 
8o , 
dep. 


4000 


22 LPCVD 
poly dep. 


1500 


56 contact etch 


2500 


87 m3 mask 




26 deglaze 




57 resist strip 




88 m3 etch 




27 WSi dep. 


1500 


58 Pt dep. 


200 


89 resist strip 




28 LPCVD 
oxide dep. 


750 


59 Pt sinter 




90 oxide dep. 


4000 


29 poly mask 




60 Pt strip 




92 nitride dep. 


10,000 


30 oxide etch 




61 TiW dep. 


2000 


93 pad mask 




91 polycide 
etch 




62 AlCu/TiW dep. 


4000 


94 pad etch 





Answer: (a) Zero, (b) 0.999. (c) 0.9998. (d) 3 years. 

2.8 (Stipple patterns, 30 min.) 

• a. Check the stipple patterns in Figure 2.9. Using ruled paper draw 8-by-8 
stipple patterns for all the combinations of layers shown. 

• b. Repeat part a for Figure 2.10. 

2.9 (Select, 20 min.) Can you draw a design-rule correct (according to the design 
rules in Tables 2.72.9) layout with a piece of select that has a minimum width of 
2 1 (rule 4.4)7 




2.10 (*Inverter layout, 60 min.) Using 1/4-inch ruled paper (or similar) draw a 
minimum-size inverter (W/L = 1 for both p -channel and n -channel transistors). 
Use a scale of one square to 2 1 and the design rules in Table 2.7Table 2.9. Do 
not use m2 or m3only ml. Draw the nwell, pwell, ndiff, and pdiff layers, but not 
the implant layers or the active layer. Include connections to the input, output, 
VDD, and VSS in ml. There must be at least one well connection to each well ( n 
-well to VDD, and p -well to VSS). Minimize the size of your cell BB. Draw the 
BB outline and write its size in 1 2 on your drawing. Use green diagonal stripes 
for ndiff, brown diagonal stripes for pdiff, red diagonal stripes for poly, blue 
diagonal stripes for ml, solid black for contact). Include a key on your drawing, 
and clearly label the input, output, VDD, and VSS contacts. 

2.11 (*AOI221 Layout, 120 min.) Layout the AOI221 shown in Figure 2.13 with 
the design rules of Tables 2.72.9 and using Figure 1.3 as a guide. Label clearly 
the ml corresponding to the inputs, output, VDD bus, and GND (VSS) bus. 
Remember to include substrate contacts. What is the size of your BB in 1 2 ? 

2.12 (Resistance, 20 min.) 

• a. Using the values for sheet resistance shown in Table 2.3, calculate the 
resistance of a 200 1 long (in the direction of current flow) by 3 1 wide 
piece of each of the layers. 

• b. Estimate the resistance of an 8-inch, 10 W cm, p -type, <100> wafer, 
measured (i) from edge to edge across a diameter and (ii) from face center 
to the face center on the other side. 

2.13 (*Layout graphics, 120 min.) Write a tutorial for capturing layout. As an 
example: 

To capture EPSF (encapsulated PostScript format) from Tanner Researchs 
L-Edit for documentation, Macintosh version... Create a black-and-white 
technology file, use Setup, Layers..., in L-Edit. The method described here does 
not work well for grayscale or color. Use File, Print..., Destination check button 
File to print from L-Edit to an EPS (encapsulated PostScript) file. After you 
choose Save, a dialog box appears. Select Format: EPS Enhanced Mac Preview, 
ASCII, Level 1 Compatible, Font Inclusion: None. Save the file. Switch to 
Frame. Create an Anchored Frame. Use File, Import, File... to bring up a dialog 
box. Check button Copy into Document, select Format: EPSF. Import the EPS 
file that will appear as a page image. Grab the graphic inside the Anchored 
Frame and move the page image around. There will be a footer with text on the 
page image that you may want to hide by using the Anchored Frame edges to 
crop the image. 

Your instructions should be precise, concise, assume nothing, and use the names 
of menu items, buttons and so on exactly as they appear to the user. Most of the 
layout figures in this book were created using L-Edit running on a Macintosh, 
with labels added in FrameMaker. Most of the layouts use the Compass layout 
editor. 




2.14 (Transistor resistance, 20 min.) Calculate I DS and the resistance (the DC 
value V DS / 1 DS as well as the AC value V DS / 1 ds as appropriate) of 
long-channel transistors with the following parameters, under the specified 
conditions. In each case state whether the transistor is in the saturation region, 
linear region, or off: 

(i) n -channel: V t n = 0.5 V, b n = 40 m AV 2 : 

V GS = 3.3V: a. V DS = 3.3 Vb.V DS = 0.0 V c. V Gs - 0.0 V, V DS = 3.3 V 

(ii) p -channel: V t p = 0.6 V, b p = 20 m AV 2 : 

V GS = 0.0 V: a. V DS = 0.0 V b. V DS = 5.0 V c. V GS = 5.0 V, V DS = 5.0 V 

2.15 (Circuit theory, 15 min.) You accidentally created the inverter shown in 
Figure 2.35 on a full-custom ASIC currently being fabricated. Will it work? Your 
manager wants a yes or no answer. Your group is a little more understanding: 

You are to make a presentation to them to explain the problems ahead. Prepare 
two foils as well as a one page list of alternatives and recommendations. 



FIGURE 2.34 A CMOS inverter with n -channel and p 
-channel transistors swapped (Problem 2.15). 




2.16 (Mask resolution, 10 min.) People use LaserWriters to make printed-circuit 
boards all the time. 

• a. Do you think it is possible to make an IC mask using a 600 dpi (dots per 
inch) LaserWriter and a transparency? 

• b. What would 1 be? 

• c. (Harder) See if you can use a microscope to look at the dot and the 
rectangular bars (serifs) of a letter 'i' from the output of a LaserWriter on 
paper (most are 300 dpi or 600 dpi). Estimate 1 . What is causing the 
problem? Why is there no rush to generate 1200 dpi LaserWriters for 
paper? Put a page of this textbook under the microscope: can you see the 
difference? What are the similar problems printing patterns on a wafer? 

2.17 (Lambda, 10 min.) Estimate 1 

• a. for your TV screen, 

• b. for your computer monitor, 

• c. (harder) a photograph. 

2.18 (Pass-transistor logic, 10 min.) 




• a. In Figure 2.36 suppose we set A = B = C = D = T, what is the value of 
F? 

• b. What is the logic strength of the signal at F? 

• c. If V DD = 5 V and V t n = 0.6 V, what would the voltage at the source 
and drain terminals of Ml, M2, and M3 be? 

• d. Will this circuit still work if V DD = 3 V? 

• e. At what point does it stop working? 



FIGURE 2.35 

FIGURE 2.36 A pass transistor chain (Problem 
2.18). 
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2.19 (Transistor parameters, 20 min.) Calculate the (a) electron and (b) hole 
mobility for the transistor parameters given in Section 2. 1 if k ' n = 80 mA V 2 
and k ' p = 40 mA V 2 . 

Answer: (a) 0.023 m 2 V 1 s 1 . 

2.20 (Quantum behavior, 10 min.) The average thermal energy of an electron is 
approximately kT , where k = 1.38 ¥ 10 23 JK 1 is Boltzmanns constant and T is 
the absolute temperature in kelvin. 

• a. The kinetic energy of an electron is (1/2) m v 2 , where v is due to 
random thermal motion, and m = 9. 1 1 ¥ 10 31 kg is the rest mass. What is 
v at 300 K? 

• b. The electron wavelength 1 = h / p , where h = 6.62 ¥ 1034 Js is the 
Planck constant, and p = m v is the electron momentum. What is 1 at 25 
C? 

• c. Compare the thermal velocity with the saturation velocity. 

• d. Compare the electron wavelength with the MOS channel length and 
with the gate-oxide thickness in a 0.25 m m process and a 0.1 mm process. 

2.21 (Gallium arsenide, 5 min.) The electron mobility in GaAs is about 8500 cm 2 
V 1 s 1 ; the hole mobility is about 400 cm 2 V 1 s 1 . If we could make 
complementary n -channel and p -channel GaAs transistors (the same way that 
we do in a CMOS process) what would the ratio of a GaAs inverter be to equalize 
rise and fall times? About how much faster would you expect GaAs transistors to 
be than silicon for the same transistor sizes? 

2.22 (Margaret of Anjou, 5 min.) 

• a. Why is it ones complement but twos complement? 

• b. Why Queens University, Belfast but Queens College, Cambridge? 




2.23 (Logic cell equations, 5 min.) Show that Eq. 2.31, 2.36, and 2.37 are correct. 

2.24 (Carry-lookahead equations, 10 min.) 

• a. Derive the carry-lookahead equations for i = 8. Write them in the same 
form as Eq. 2.56. 

• b. Derive the equations for the BrentKung structure for i = 8. 

2.25 (OAI cells, 20 min.) Draw a circuit schematic, including transistor sizes, for 
(a) an OAI321 cell, (b) an AOI321 cell, (c) Which do you think will be larger? 

2.26 (**Making stipple patterns) Construct a set of black-and-white, transparent, 
8-by-8 stipple patterns for a CMOS process in which we draw both well layers, 
the active layer, poly, and both diffusion implant layers separately. Consider only 
the layers up to ml (but include ml and the contact layer). One useful tool is the 
Apple Macintosh Control Panel, 'General Controls,’ that changes the Mac desktop 
pattern. 

• a. (60 min.) Create a set of patterns with which you can detect any errors 
(for example, n -well and p -well overlap, or n -implant and p -implant 
overlap). 

• b. (60 min.+) Using a layout of an inverter as an example, find a set of 
patterns that allows you to trace transistors and connections (a very 
qualitative goal). 

• c. (Days+) Find a set of grayscale stipple patterns that allow you to 
produce layouts that look nice in a report (much, much harder than it 
sounds). 

2.27 (AOI and OAI cells, 10 min.). Draw the circuit schematics for an AOI22 and 
an OAI22 cell. Clearly label each transistor as on or off for each cell for an input 
vector of (Al, A2, Bl, B2) = (0101). 

2.28 (Flip-flops and latches, 10 min.) In no more than 20 words describe the 
difference between a flip-flop and a latch. 

2.29 (**An old argument) Should setup and hold times appear under maximum, 
minimum, or typical in a data sheet? (From Peter Alfke.) 

2.30 (***Setup, 20 min.) There is no such thing as a setup and hold time, just 
two setup timesfor a T and for a 'O'. Comment. (From Clemenz Portmann.) 

2.31 (Subtracter, 20 min.) Show that you can rewrite the equations for a full 
subtracter (Eqs. 2.652.66) to be the same as a full adderexcept that A is inverted 
in the borrow out equation, as follows: 

DIFF = A • B • BIN 

SUM(A, B, BIN) , (2.70) 

BOUT = NOT (A) • B + NOT(A) • BIN + B • BIN 

MAJ(NOT(A), B, CIN) . (2.71) 




Explain very carefully why we need to connect BIN[0] to VSS. Show that for a 
subtracter implemented by inverting the B input of an adder and setting CIN[0] = 
T, the true overflow for ones complement or twos complement representations 
is XOR(CIN[MSB], CIN[MSB 1]). Does this hold for the above subtracter? 

2.32 (Complex CMOS cells) Logic synthesis has completely changed the nature 
of combinational logic design. Synthesis tools like to see a huge selection of cells 
from which to choose in order to optimize speed or area. 

• a. (20 min.) How many AOI nnnn cells are there, if the maximum value of 
n = 4? 

• b. (30 min.) Consider cells of the form AOI nnnn where n can be negative 
indicating a set of inputs are inverted. Thus, an AOI-22 (where the hyphen 

indicates the following input is inverted) is a NOR(NOR(A, B), AND(C, 
D)), for example. How many logically different cells of the AOI xxxx 
family are there if x can be -2’, -T, T, or ’2’ with no more than four 
inputs? Remember the AOI family includes OAI, AO, and OA cells as 
well as just AOI. List them using an extension to the notation for a cell 
with mixed-sign inputs: for example, an AO( 1-1)1 cell is 
NOT(NOR(AND(A, NOT(B)), C)). Hint: Be very careful because some 
cells with negative inputs are logically equivalent to others with positive 
inputs. 

• c. (10 min.) If we include NAND and NOR cells with inverting inputs in a 
library, how many different cells in the NAND family are there with four 
or fewer inputs (the NAND family includes NOR, AND, and OR cells)? 

• d. (30 min.) How many cells in the AOI and NAND families are there with 
four inputs or less that use fewer than eight transistors? Include cells that 
are logically equivalent but have different physical implementations. Lor 
example, a NAND 1-1 cell, requiring six transistors, is logically equivalent 
to an OR1-1 cell that requires eight transistors. The OR1-1 implementation 
may be useful because the output inverter can easily be sized to produce an 
OR1-1 cell with higher drive. 

• e. (**60 min.) How many cells are there with fewer than four inputs that 
do not fit into the AOI or NAND families? Hint: There is an inverter, a 
buffer, a half-adder, and the three-input majority function, for example. 

• f. (***) Recommend a better, user-friendly, naming system (which is also 
CAD tool compatible) for combinational cells. 

2.33 (**Design rules, 60 min.) A typical set of deep submicron CMOS design 
rules is shown in Table 2.16. Design rules are often confusing and use the 
following buzz-words, perhaps to prevent others from understanding them. 

• The end cap is the extension of poly gate beyond the active or diffusion. 

• Overlap . Normally one material is completely contained within the other, 
overlap is then the amount of the surround. 

• Extension refers to the extension of diffusion beyond the poly gate. 




Same (in a spacing rule) means the space to the same type of diffusion or 
implant. 

Opposite refers to the space to the opposite type of diffusion or implant. 

A dogbone is the area surrounding a contact. Often the spacing to a 
dogbone contact is allowed be slightly less than to an isolated line. 

Field is the area outside the active regions. The field oxide (sandwiched 
between the diffusion layers and the poly or ml layers) is thicker than the 
gate oxide and separates transistors. 

Exact refers to contacts that are all the same size to simplify fabrication. 

A butting contact consists of two adjacent diffusions of the opposite type 
(connected with metal). This occurs when a well contact is placed next to a 
source contact. 

Fat metal . Some design rules use different spacing for metal lines that are 
wider than a certain amount. 

a. Draw a copy of the MOSIS rules as shown in Figure 2. 1 1, but using the 
rule numbers and values in microns and 1 from Table 2.16. 

b. How compatible are the two sets of rules? 

TABLE 2.16 ASIC design rules (Problem 2.33). Absolute values in 



microns are given 


for 1 : 


= 0.2 m m. 








Layer Rule_2 


m m 


1 


Layer 


Rule 


m m 


1 


nwell N.l width 


2 


10 


implant 


1. 1 width 


0.6 


3 


N.2 sp. 
(same) 


1 


5 




1.2 sp. (same) 


0.6 


3 


diff D.l width 


0.5 


2.5 




1.3 sp. to diff 
(same) 


0.55 


2.75 


D.2 

transistor 

width 


0.6 


3 




1.4 sp. to 
butting diff 


0 


0 


D.3 sp. 
(same) 


0.6 


3 




1.4 ov. of diff 


0.25 


1.25 


D.4 sp. 
(opposite) 


0.8 


4 




1.5 sp. to poly 
on active 


0.5 


2.5 


D.5 p+ 
(nwell) to 
n+ (pwell) 


2.4 


12 




1.6 sp. 
(opposite) 


0.3 


1.5 


D.6 nwell 
ov. of n+ 


0.6 


3 




1.7 sp. to 

butting 

implant 


0 


0 


D.7 nwell 
sp. to p+ 


0.6 


3 


contact 


C.l size 
(exact) 


0.4 


2 



D.8 

extension 


0.6 


3 


C.2 sp. 


0.6 


3 


over gate 
D.9 nwell 

ov. of P+ 


1.2 


6 


C.3 poly ov. 


0.3 


1.5 


D.10 nwell 
sp. to n+ 


1.2 


6 


C.4 diff ov. (2 
sides/others) 


0.25/0.35 


1.25/1.75 


P. 1 width 


0.4 


2 


C.5 metal ov. 


0.25 


1.25 


P.2 gate 


0.4 


2 


C.6 sp. to poly 0.3 


1.5 


P.3 sp. 

(over 

active) 


0.6 


3 


C.7 poly 
contact to diff 


0.5 


2.5 


P.4 sp. 
(over field) 


0.5 


2.5 ml 


Mn.l width 


0.6/0.7/1.0 3/3.5/4 


P.5 short 
sp. 

(dogbone) 


0.45 


2 25 + 

m2/m3 


Mn.2 sp. (fat 
> 25 1 is 5 1 ) 


0.6/0.7/1.0 3/3.5/4 


P.6 end cap 


0.45 


2.25 


Mn.3 sp. 
(dogbone) 


0.5 


2.5 


P.7 sp. to 
diffusion 


0.2 


1 vl 


Vn.l size 
(exact) 


0.4 


2 






+v2/v3 


Vn.2 sp. 


0.8 


4 








Vn.3 metal 


0.25 


1.25 



ov. 



2.34 (ESD, 10 min.) 

• a. Explain carefully why a CMOS device can withstand a 2000 V ESD 
event when the gate breakdown voltage is only 510 V, but that shorting a 
device pin to a 10 V supply can destroy it. 

• b. Explain why an electric shock from a 240 VAC supply can kill you, but 
an 3000 VDC shock from a static charge (walking across a nylon carpet 
and touching a metal doorknob) only gives you a surprise. 

2.35 (*Stacks in CMOS cells, 60 min.) 

• a. Given a CMOS cell of the form AOI ijk or OAI ijk ( i, j, k > 0) derive an 
equation for the height (the number of transistors in series) and the width 
(the number of transistor in parallel) of the n -channel and p -channel 
stacks. 

• b. Suppose we increase the number of indices to four, i.e. AOI ijkl . How 
do your equations change? 

• c. If the stack height cannot be greater than three, which three-index AOI 
ijk and OAI ijk cells are illegal? Often limiting the stack height to three or 
four is a design rule for radiation-hard librariesuseful for satellites. 




2.36 (Duals, 20 min.) Draw the n -channel stack (including device sizes, 
assuming a ratio of 2) that complements the p -channel stack shown in 
Figure 2.37. 



FIGURE 2.37 A p -channel stack using a bridge 
device, E (Problem 2.36). 



2.37 (***FPGA conditional-sum adder, days+) A Xilinx application-note (M. 
Klein, Conditional sum adder adds 16 bits in 33 ns, Xilinx Application Brief, 
Xilinx data book, 1992, p. 6-26) describes a 16-bit conditional-sum adder using 
41 CLBs in three stages of addition; see also [Sklansky, I960]. A Xilinx XC3000 
or XC4000 CLB can perform any logic function of five variables, or two 
functions of (the same) four variables. Can you find a solution with fewer CLBs 
in three stages? Hint: R. P. Halverson of the University of Hawaii produced a 
solution with 36 CLBs. 

2.38 (Encoding, 10 min.) Booths algorithm was suggested by a shortcut used by 
operators of decimal calculating machines that required turning a handle. To 
multiply 5 by 23 you set the levers to 5 and turned the handle three times, change 
gears and turn twice more. 

• a. What is the equivalent of 1 423 4 3 ? 

• b. How many turns do we save using the shortcut? 

2.39 (CSD, 20 min.) 

• a. Show how to convert 1010111 (decimal 87) to the CSD vector 10 1 0 1 
00 1 . 

• b. Convert 1000101 to the CSD vector. 

• c. How do you know that 1 1 10011 1 (decimal 101) is not the CSD vector 
representation of 1100101 (decimal 101)? 




1. Depths of layers are in angstroms (negative values are etch depths). For 
abbreviations used, see Problem 2.7. 

2. sp. = space; ov. = overlap; same = same diffusion or implant type; opposite = 
opposite implant or diffusion type; 

diff = p-i- or n+; p+ = p+ diffusion; n-H = n-i- diffusion; implant = p+ or n+ implant 
select. 
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ASIC LIBRARY DESIGN 



Once we have decided to use an ASIC design styleusing predefined and 
precharacterized cells from a librarywe need to design or buy a cell library. Even 
though it is not necessary a knowledge of ASIC library design makes it easier to 
use library cells effectively. 




3.1 Transistors as Resistors 



In Section 2.1, CMOS Transistors, we modeled transistors using ideal switches. If this model were 
accurate, logic cells would have no delay. 




FIGURE 3.1 A model for CMOS logic delay, (a) A CMOS inverter with a load capacitance, C out 
. (b) Input, v(inl) , and output, v(outl) , waveforms showing the definition of the falling 
propagation delay, t PDf . In this case delay is measured from the input trip point of 0.5. The output 
trip points are 0.35 (falling) and 0.65 (rising). The model predicts t PDf a R pd ( C p + C out ). 

(c) The model for the inverter includes: the input capacitance, C ; the pull-up resistance ( R pu ) 
and pull-down resistance ( R pd ); and the parasitic output capacitance, C p . 

The ramp input, v(inl) , to the inverter in Figure 3.1 (a) rises quickly from zero to V DD . In 
response the output, v(outl) , falls from V DD to zero. In Figure 3.1 (b) we measure the 
propagation delay of the inverter, t PD , using an input trip point of 0.5 and output trip points of 
0.35 (falling, t PDf ) and 0.65 (rising, t PDr ). Initially the n -channel transistor, ml , is off . As the 
input rises, ml turns on in the saturation region ( V DS > V GS V tn ) before entering the linear 
region ( V DS < V GS V t n ). We model transistor ml with a resistor, R pd (Figure 3.1 c); this is 
the pull-down resistance . The equivalent resistance of m2 is the pull-up resistance , R pu . 

Delay is created by the pull-up and pull-down resistances, R pd and R pu , together with the parasitic 
capacitance at the output of the cell, C p (the intrinsic output capacitance ) and the load capacitance 
(or extrinsic output capacitance ), C out (Figure 3.1 c). If we assume a constant value for R pd , the 
output reaches a lower trip point of 0.35 when (Figure 3.1 b), 

* PDf 

0.35 V DD = V DD exp -(3.1) 

R pd ( C out C p ) 

An output trip point of 0.35 is convenient because In (1/0.35) = 1.04 a 1 and thus 
1 PDf = R P d ( c out + c p ) ln (1/0.35) a R pd ( C out + C p ) . (3.2) 

The expression for the rising delay (with a 0.65 output trip point) is identical in form. Delay thus 
increases linearly with the load capacitance. We often measure load capacitance in terms of a 



standard load the input capacitance presented by a particular cell (often an inverter or two-input 
NAND cell). 



We may adjust the delay for different trip points. For example, for output trip points of 0. 1/0.9 we 
multiply Eq. 3.2 by ln(O.l) = 2.3, because exp (2.3) = 0.100. 



Figure 3.2 shows the DC characteristics of a CMOS inverter. To form Figure 3.2 (b) we take the n 
-channel transistor surface (Figure 2.4b) and add that for a p -channel transistor (rotated to account 
for the connections). Seen from above, the intersection of the two surfaces is the static transfer 
curve of Figure 3.2 (a)along this path the transistor currents are equal and there is no output 
current to change the output voltage. Seen from one side, the intersection is the curve of Figure 3.2 
(c). 

(a) (b) 




(c) 



FIGURE 3.2 CMOS inverter characteristics, (a) This 
static inverter transfer curve is traced as the inverter 
switches slowly enough to be in equilibrium at all times 
( I DSn = I DSp )• (b) This surface corresponds to the 
current flowing in the n -channel transistor (falling 
delay) and p -channel transistor (rising delay) for any 
trajectory, (c) The current that flows through both 
transistors as the inverter switches along the equilibrium 
path. 




The input waveform, v(inl) , and the output load (which determines the transistor currents) dictate 
the path we take on the surface of Figure 3.2 (b) as the inverter switches. We can thus see that the 
currents through the transistors (and thus the pull-up and pull-down resistance values) will vary in a 
nonlinear way during switching. Deriving theoretical values for the pull-up and pull-down 
resistance values is difficultinstead we work the problem backward by picking the trip points, 
simulating the propagation delays, and then calculating resistance values that fit the model. 



(a) 



notes nrq»o» i.tos 










(c) 



(b) tp D , delay/ ps 






FIGURE 3.3 Delay, (a) LogicWorks schematic for inverters driving 1, 2, 4, and 8 standard loads 
(1 standard load = 0.034 pF in this case), (b) Transient response (falling delay only) from PSpice. 
The postprocessor Probe was used to mark each waveform as it crosses its trip point (0.5 for the 
input, 0.35 for the outputs). For example v(outl_4) (4 standard loads) crosses 1.0467 V ( a 0.35 V 
DD ) at t = 169.93 ps. (c) Falling and rising delays as a function of load. The slopes in pspF 1 
corresponds to the pull-up resistance (1281 W ) and pull-down resistance (817 W ). 

(d) Comparison of the delay model (valid for t > 20 ps) and simulation (4 standard loads). Both are 
equal at the 0.35 trip point. 



Figure 3.3 shows a simulation experiment (using the G5 process SPICE parameters from 
Table 2.1). From the results in Figure 3.3 (c) we can see that R pd = 817 W and R pu = 1281 W for 
this inverter (with shape factors of 6/0.6 for the n -channel transistor and 12/0.6 for the p -channel) 
using 0.5 (input) and 0.35/0.65 (output) trip points. Changing the trip points would give different 
resistance values. 

We can check that 817 W is a reasonable value for the pull-down resistance. In the saturation 
region I DS (sat) is (to first order) independent of V DS . For an n -channel transistor from our 
generic 0.5 m m process (G5 from Section 2.1) with shape factor W/F = 6/0.6, 1 DSn ( sat ) = 2.5 mA 
(at V GS = 3V and V DS = 3V). The pull-down resistance, R j , that would give the same drain 
source current is 

R ! = 3.0 V / (2.5 ¥ 10 3 A) = 1200 W . (3.3) 



This value is greater than, but not too different from, our measured pull-down resistance of 817 W . 
We might expect this result since Figure 3.2b shows that the pull-down resistance reaches its 



maximum value at V GS = 3V, V DS = 3V. We could adjust the ratio of the logic so that the rising 
and falling delays were equal; then R = R pd = R pu is the pull resistance . 

Next, we check our model against the simulation results. The model predicts 
t' 

v(outl) a V DD exp for t ' > 0 . (3.4) 

R pd ( C out C p ) 

( t' is measured from the point at which the input crosses the 0.5 trip point, t' = 0 at t = 20 ps). With 
C p = 4 standard loads = 4¥ 0.034 pF = 0.136 pF, 

R P d ( C ou t + C p ) = (38 + 817 (0.136)) ps = 149.112 ps . (3.5) 

To make a comparison with the simulation we need to use In (1/0.35) = 1.04 and not approximately 
1 as we have assumed, so that (with all times in ps) 

t' 

v(outl) a 3.0 exp V 

149.112/1.04 

(t 20) 

= 3.0 exp for t > 20 ps . (3.6) 

143.4 

Equation 3Tj is plotted in Figure 3.3 (d). For v(outl) = 1.05 V (equal to the 0.35 output trip point), 
Eq. 3.6 predicts t = 20 + 149. 1 12 a 169 ps and agrees with Figure 3.3 (b)it should because we 
derived the model from these results! 

Now we find C p . From Figure 3.3 (c) and Eq. 3.2 

t PDr = (52 + 1281 C out ) ps thus C pr = 52/1281 = 0.041 pF (rising) , 

t p D f = (38 + 817 C out ) ps thus C pf = 38/817 = 0.047 pF (falling). (3.7) 

These intrinsic parasitic capacitance values depend on the choice of output trip points, even though 
C p f R pdf and C pr R pdr are constant for a given input trip point and waveform, because the pull-up 
and pull-down resistances depend on the choice of output trip points. We take a closer look at 
parasitic capacitance next. 



3.2 Transistor Parasitic 
Capacitance 

Logic-cell delay results from transistor resistance, transistor (intrinsic) parasitic 
capacitance, and load (extrinsic) capacitance. When one logic cell drives another, the 
parasitic input capacitance of the driven cell becomes the load capacitance of the driving 
cell and this will determine the delay of the driving cell. 

Figure 3.4 shows the components of transistor parasitic capacitance. SPICE prints all of 
the MOS parameter values for each transistor at the DC operating point. The following 
values were printed by PSpice (v5.4) for the simulation of Figure 3.3 : 




FIGURE 3.4 Transistor parasitic capacitance, (a) An n -channel MOS transistor with 
(drawn) gate length L and width W. (b) The gate capacitance is split into: the constant 
overlap capacitances C G sov > C G dov > an d C G bov an d the variable capacitances C G s 
, C gb > an( t C gd > which depend on the operating region, (c) A view showing how the 
different capacitances are approximated by planar components ( T F0X is the field-oxide 
thickness), (d) C BS and C BD are the sum of the area ( C BSJ , C BDJ ), sidewall ( C 
bssw » C BDSW ), and channel edge ( C BSJ G ate > C BDJ gate ) capacitances. (e)(f) The 
dimensions of the gate, overlap, and sidewall capacitances (L D is the lateral diffusion). 

NAME ml m2 
MODEL CMOSN CMOSP 
ID 7.49E-1 1 -7.49E-11 
VGS 0.00E+00 -3.00E+00 
VDS 3.00E+00 -4.40E-08 
VBS 0.00E+00 0.00E+00 
VTH 4.14E-01 -8.96E-01 
VDSAT 3.51E-02 -1.78E+00 
GM 1.75E-09 2.52E-11 
GDS 1.24E-10 1.72E-03 
GMB 6.02E-10 7.02E-12 
CBD 2.06E-15 1.71E-14 
CBS 4.45E-15 1.71E-14 
CGSOV 1.80E-15 2.88E-15 
CGDOV 1.80E-15 2.88E-15 
CGBOV 2.00E-16 2.01E-16 
CGS 0.00E+00 1.10E-14 
CGD 0.00E+00 1.10E-14 
CGB 3.88E-15 0.00E+00 

The parameters ID ( I DS ), VGS , VDS , VBS , VTH (V t ), and VDSAT (V DS (sat) ) are 
DC parameters. The parameters GM , GDS , and GMB are small-signal conductances 
(corresponding to I DS / V G s > I DS ^ ^ DS ’ an ^ I DS ^ ^ BS ’ respectively). The 
remaining parameters are the parasitic capacitances. Table 3.1 shows the calculation of 
these capacitance values for the n -channel transistor ml (with W = 6 m m and L = 0.6 m 
m) in Figure 3.3 (a). 

TABLE 3.1 Calculations of parasitic capacitances for an n-channel MOS transistor. 



CBDJ + A D Cj(l+V DB /f B )mJ (f C BD j = (4.032 ¥ 10 15 )(1 + (3/1)) 



PSpice Equation 



CBD C BD - C BDJ + C B dsw 



Values 1 for VGS = 0V, VDS = 3V, 
VSB = 0V 

C BD = 1.855 ¥ 10 13 + 2.04 ¥ 10 16 
= 2.06 ¥10 13 F 



b = PB) 

C BDSW = P D C JSW (! + V DB / f B ) 
mJSW 

(P D may or may not include channel 
edge) 



°-56 = 1.86 ¥ 10 15 F 



C bdsw = (4.2 ¥ 10 16 )(1 + (3/1)) 0-5 



= 2.04 ¥ 10 16 F 



CBS 


C BS - C BSJ + C BSSW 


C BS = 4.032 ¥ 10 15 + 4.2 ¥ 10 16 
4.45 ¥10 15 F 




C BSJ + A S C J ( 1 + V SB / f B ) mJ 


A S C j = (7.2 ¥ 10 15 )(5.6¥ 10 4 ; 
4.03 ¥10 15 F 




C BSSW = P S C JSW (1 + V SB / f B ) 


P s C j SW = (8.4 ¥ 10 6 )(5 ¥ 10 11 




mJSW 


= 4.2¥ 10 16 F 


CGSOV 


C GSOV = W EFF C GSO ’ W EFF = W 
2W d 


C G sov = (6¥106 )(3 ¥ 10 10 ) = 
¥10 16 F 


CGDOV c GDOV - W EFF C gso 


C gdov = (6 ¥ 10 6 )(3 ¥ 10 10 ) = 
1.8 ¥ 10 15 F 


CGBOV 


C GBOV - L EFF C GBO i L e FF - L 2L 
D 


Cgdov = (0.5 ¥10 6 )(4¥ 10 10 ) 
2¥ 10 16 F 




c GS /c o = 0 (off), 0.5 ( lin -), 0.66 (sat.) 


C 0 = (6 ¥ 10 6 )(0.5 ¥ 10 6 


CGS 


C q (oxide capacitance) = W EF L EFF e 


)(0.00345) = 1.03 ¥ 10 14 F 




/ T 

OX 7 A OX 


C gs = 0.0F 


CGD 


c GD /c o = 0 (off), 0.5 (tin.), 0 (sat.) 


c gd = o.of 


CGB 


C gb = 0 (on), = C q in series with C G s 
(off) 


C gb = 3.88 ¥ 10 15 F , C s = 
depletion capacitance 



.MODEL CMOSN NMOS LEVEL=3 PHI=0.7 TOX=10E-09 XJ=0.2U TPG=1 
VTO=0.65 DELTA=0.7 

+ LD=5E-08 KP=2E-04 UO=550 THETA=0.27 RSH=2 GAMMA=0.6 
u NSUB=L4E+17NFS=6E+11 
liipiit + v MAX = 2E+05 ETA=3.7E-02 KAPPA=2.9E-02 CGDO=3.0E-10 
CGSO=3.0E-10 CGBO=4.0E-10 
+ CJ=5.6E-04 MJ=0.56 CJSW=5E-11 MJSW=0.52 PB=1 
ml outl ini 0 0 cmosn W=6U L=0.6U AS=7.2P AD=7.2P PS=8.4U PD=8.4U 

3.2.1 Junction Capacitance 

The junction capacitances, C BD and C BS , consist of two parts: junction area and 
sidewall; both have different physical characteristics with parameters: CJ and MJ for the 
junction, CJSW and MJSW for the sidewall, and PB is common. These capacitances 
depend on the voltage across the junction ( V DB and V SB ). The calculations in Table 

3.1 assume both source and drain regions are 6mm¥ 1.2 mm rectangles, so that A D = 

A s = 7.2 (mm) 2 , and the perimeters (excluding the 1 .2 m m channel edge) are P D = P 
S = 6 + 1.2 + 1.2 = 8.4 m m. We exclude the channel edge because the sidewalls facing 
the channel (corresponding to C BSJ gate an d C BDJ gate ' n Figure 3.4 ) are different 
from the sidewalls that face the field. There is no standard method to allow for this. It is a 
mistake to exclude the gate edge assuming it is accounted for in the rest of the modelit is 
not. A pessimistic simulation includes the channel edge in P D and P s (but a true 
worst-case analysis would use more accurate models and worst-case model parameters). 
In HSPICE there is a separate mechanism to account for the channel edge capacitance 
(using parameters ACM and CJGATE ). In Table 3.1 we have neglected C j gate • 



For the p -channel transistor m2 (W = 12 m m and L = 0.6 m m) the source and drain 
regions are 12 m m ¥ 1.2mm rectangles, so that A D = A s a 14 (mm) 2 , and the 
perimeters areP D = P s = 12 + 1.2 + 1.2 a 14mm (these parameters are rounded to two 
significant figures solely to simplify the figures and tables). 

In passing, notice that a 1.2 m m strip of diffusion in a 0.6 m m process ( 1 = 0.3 m m) is 
only 4 1 widewide enough to place a contact only with aggressive spacing rules. The 
conservative rules in Figure 2.1 1 would require a diffusion width of at least 2 (rule 6.4a) 

+ 2 (rule 6.3a) +1.5 (rule 6.2a) = 5.5 1 . 

3.2.2 Overlap Capacitance 

The overlap capacitance calculations for C G sov an d C G dov i n Table 3.1 account for 
lateral diffusion (the amount the source and drain extend under the gate) using SPICE 
parameter LD = 5E-08 or L D = 0.05 m m. Not all versions of SPICE use the equivalent 
parameter for width reduction, WD (assumed zero in Table 3.1 ), in calculating C gdov 
and not all versions subtract W D to form W EFF . 

3.2.3 Gate Capacitance 

The gate capacitance calculations in Table 3.1 depend on the operating region. The gate 
source capacitance C G s vai 'i es from zero when the transistor is off to 0.5C 0 (0.5 ¥ 

1.035 ¥ 10 15 = 5.18 ¥ 10 16 F) in the linear region to (2/3)C 0 in the saturation region 
(6.9 ¥ 10 16 F). The gatedrain capacitance C G d var ies from zero (off) to 0.5C 0 (linear 
region) and back to zero (saturation region). 

The gatebulk capacitance C G b ma y viewed as two capacitors in series: the fixed 
gate-oxide capacitance, C Q = W EFF L EFF e ox / T ox , and the variable depletion 
capacitance, C s = W EFF L EFF e Si / x d , formed by the depletion region that extends 
under the gate (with varying depth x d ). As the transistor turns on the conducting channel 
appears and shields the bulk from the gateand at this point C G b falls t0 zero - Even with 
V Gs = 0 V, the depletion width under the gate is finite and thus C G b a 4 ¥ 10 15 F is less 
than C q a 10 16 F. In fact, since C gb a 0-5 C q , we can tell that at V gs — 0 V, C ^ a C q 



Figure 3.5 shows the variation of the parasitic capacitance values. 
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FIGURE 3.5 The variation of n -channel transistor parasitic capacitance. Values were 
obtained from a series of DC simulations using PSpice v5.4, the parameters shown in 
Table 3.1 ( LEVEL=3 ), and by varying the input voltage, v(inl) , of the inverter in 
Figure 3.3 (a). Data points are joined by straight lines. Note that CGSOV = CGDOV . 



3.2.4 Input Slew Rate 



Figure 3.6 shows an experiment to monitor the input capacitance of an inverter as it 
switches. We have introduced another variablethe delay of the input ramp or the slew 
rate of the input. 



In Figure 3.6 (b) the input ramp is 40 ps long with a slew rate of 3 V/ 40 ps or 75 GVs 1 
as in our previous experimentsand the output of the inverter hardly moves before the 
input has changed. The input capacitance varies from 20 to 40 fF with an average value 
of approximately 34 fF for both transitionswe can measure the average value in Probe by 
plotting AVG(-i(Vin)) . 
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FIGURE 3.6 The input capacitance of an inverter, (a) Input capacitance is measured by 
monitoring the input current to the inverter, i(Vin) . (b) Very fast switching. The current, 
i(Vin) , is multiplied by the input ramp delay ( D t = 0.04 ns) and divided by the voltage 
swing ( D V = V DD = 3 V) to give the equivalent input capacitance, C = i D t / D V . 
Thus an adjusted input current of 40 fA corresponds to an input capacitance of 40 fF. 

The current, i(Vin) , is positive for the rising edge of the input and negative for the 
falling edge, (c) Very slow switching. The input capacitance is now equal for both 
transitions. 

In Figure 3.6 (c) the input ramp is slow enough (300 ns) that we are switching under 
almost equilibrium conditionsat each voltage we allow the output to find its level on the 
static transfer curve of Figure 3.2 (a). The switching waveforms are quite different. The 
average input capacitance is now approximately 0.04 pF (a 20 percent difference). The 
propagation delay (using an input trip point of 0.5 and an output trip point of 0.35) is 
negative and approximately 150 127 = 23 ns. By changing the input slew rate we have 
broken our model. For the moment we shall ignore this problem and proceed. 

The calculations in Table 3.1 and behavior of Figures 3.5 and U6 are very complex. 
How can we find the value of the parasitic capacitance, C , to fit the model of Figure 3.1 
? Once again, as we did for pull resistance and the intrinsic output capacitance, instead of 
trying to derive a theoretical value for C, we adjust the value to fit the model. Before we 
formulate another experiment we should bear in mind the following questions that the 
experiment of Figure 3.6 raises: Is it valid to replace the nonlinear input capacitance with 
a linear component? Is it valid to use a linear input ramp when the normal waveforms are 
so nonlinear? 

Figure 3.7 shows an experiment crafted to answer these questions. The experiment has 
the following two steps: 

1. Adjust c2 to model the input capacitance of m5/6 ; then C = c2 = 0.0335 pF. 

2. Remove all the parasitic capacitances for inverter m9/10 except for the gate 
capacitances C GS , C GD , and C GB and then adjust c3 (0.01 pF) and c4 (0.025 
pF) to model the effect of these missing parasitics. 
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FIGURE 3.7 Parasitic capacitance, (a) All devices in this circuit include parasitic 
capacitance, (b) This circuit uses linear capacitors to model the parasitic 
capacitance of m9/10 . The load formed by the inverter ( m5 and m6 ) is modeled 
by a 0.0335 pF capacitor ( c2 ); the parasitic capacitance due to the overlap of the 
gates of m3 and m4 with their source, drain, and bulk terminals is modeled by a 

0.01 pF capacitor ( c3 ); and the effect of the parasitic capacitance at the drain 
terminals of m3 and m4 is modeled by a 0.025 pF capacitor ( c4 ). (c) The two 
circuits compared. The delay shown (1.22 1.135 = 0.085 ns) is equal to t PD f for 
the inverter m3/4 . (d) An exact match would have both waveforms equal at the 
0.35 trip point (1.05 V). 



We can summarize our findings from this and previous experiments as follows: 

1 . Since the waveforms in Figure 3.7 match, we can model the input capacitance of a 
logic cell with a linear capacitor. However, we know the input capacitance may 
vary (by up to 20 percent in our example) with the input slew rate. 

2. The input waveform to the inverter m3/m4 in Figure 3.7 is from another inverter 
not a linear ramp. The difference in slew rate causes an error. The measured delay 
is 85 ps (0.085 ns), whereas our model (Eq. 3.7 ) predicts 

t PDr = (38 + 817 C out ) ps = ( 38 + (817)-(0.0355) ) ps = 65 ps . (3.8) 

3. The total gate-oxide capacitance in our inverter with T ox = 100 A is 

C 0=( W n L n + W p L p) e ox T ox 

= (34.5 ¥ 10 4 >(6M (0.6) + (12>(0.6) ) pF = 0.037 pF . (3.9) 

4. All the transistor parasitic capacitances excluding the gate capacitance contribute 
0.01 pF of the 0.0335 pF input capacitanceabout 30 percent. The gate capacitances 
contribute the rest0.025 pF (about 70 percent). 



The last two observations are useful. Since the gate capacitances are nonlinear, we only 



see about 0.025/0.037 or 70 percent of the 0.037 pF gate-oxide capacitance, C 0 , in the 
input capacitance, C . This means that it happens by chance that the total gate-oxide 
capacitance is also a rough estimate of the gate input capacitance, C a C 0 . Using L and 
W rather than L EFF and W EFF in Eq. 3.9 helps this estimate. The accuracy of this 
estimate depends on the fact that the junction capacitances are approximately one-third of 
the gate-oxide capacitancewhich happens to be true for many CMOS processes for the 
shapes of transistors that normally occur in logic cells. In the next section we shall use 
this estimate to help us design logic cells. 



3.3 Logical Effort 

In this section we explore a delay model based on logical effort, a term coined by 
Ivan Sutherland and Robert Sproull [1991], that has as its basis the time-constant 
analysis of Carver Mead, Chuck Seitz, and others. 

We add a catch all nonideal component of delay, t q , to Eq. 3.2 that includes: 
(1) delay due to internal parasitic capacitance; (2) the time for the input to reach 
the switching threshold of the cell; and (3) the dependence of the delay on the 
slew rate of the input waveform. With these assumptions we can express the 
delay as follows: 

t p£> — R ( C out + Cp) + tq.(3.10) 

(The input capacitance of the logic cell is C , but we do not need it yet.) 

We will use a standard-cell library for a 3.3 V, 0.5 m m (0.6 m m drawn) 
technology (from Compass) to illustrate our model. We call this technology C5 ; 
it is almost identical to the G5 process from Section 2.1 (the Compass library 
uses a more accurate and more complicated SPICE model than the generic 
process). The equation for the delay of a IX drive, two-input NAND cell is in the 
form of Eq. 3.10 ( C out is in pF): 

t pp> — (0.07 + 1.46 C out + 0.15) ns . (3.11) 

The delay due to the intrinsic output capacitance (0.07 ns, equal to RC p ) and the 
nonideal delay ( t q = 0.15 ns) are specified separately. The nonideal delay is a 
considerable fraction of the total delay, so we may hardly ignore it. If data books 
do not specify these components of delay separately, we have to estimate the 
fractions of the constant part of a delay equation to assign to RC p and t q (here 
the ratio RC p / 1 q is approximately 2). 

The data book tells us the input trip point is 0.5 and the output trip points are 0.35 
and 0.65. We can use Eq. 3.11 to estimate the pull resistance for this cell as R a 

1.46 nspF 1 or about 1.5 k W . Equation 3.11 is for the falling delay; the data 
book equation for the rising delay gives slightly different values (but within 10 
percent of the falling delay values). 

We can scale any logic cell by a scaling factor s (transistor gates become s times 
wider, but the gate lengths stay the same), and as a result the pull resistance R 
will decrease to R / s and the parasitic capacitance C p will increase to sC p . 



Since t q is nonideal, by definition it is hard to predict how it will scale. We shall 
assume that t q scales linearly with s for all cells. The total cell delay then scales 
as follows: 

t pd - ( R / s )'( C out + S C p ) + st q . (3.12) 

For example, the delay equation for a 2X drive ( s = 2), two-input NAND cell is 
t p£> = (0.03 + 0.75 C out + 0.51) ns . (3.13) 

Compared to the IX version (Eq. 3.11 ), the output parasitic delay has decreased 
to 0.03 ns (from 0.07 ns), whereas we predicted it would remain constant (the 
difference is because of the layout); the pull resistance has decreased by a factor 
of 2 from 1.5 k W to 0.75 k W , as we would expect; and the nonideal delay has 
increased to 0.51 ns (from 0.15 ns). The differences between our predictions and 
the actual values give us a measure of the model accuracy. 

We rewrite Eq. 3.12 using the input capacitance of the scaled logic cell, Cj n = s 
C, 

r 

v out 

t pd - RC + RC p + st q . (3.14) 

C 

v in 

Finally we normalize the delay using the time constant formed from the pull 
resistance R inv and the input capacitance C inv of a minimum-size inverter: 

( RC ) ( C out / C j n ) + RC p + st q 
d = = f + p + q . (3.15) 

t 

The time constant tau , 
t = RinvC inv , (3.16) 

is a basic property of any CMOS technology. We shall measure delays in terms 
of t . 

The delay equation for a IX (minimum-size) inverter in the C5 library is 
t PDf - R pd ( c out + c P ) ln (1/0.35) a R pd ( C out + C p ) . (3.17) 

Thus tq inv = 0. 1 ns and R inv = 1.60 k W . The input capacitance of the IX 
inverter (the standard load for this library) is specified in the data book as C inv = 
0.036 pF; thus t = (0.036 pF)(1.60 kW) = 0.06 ns for the C5 technology. 

The use of logical effort consists of rearranging and understanding the meaning 
of the various terms in Eq. 3.15 . The delay equation is the sum of three terms, 



d = f + p + q . (3.18) 

We give these terms special names as follows: 

delay = effort delay + parasitic delay + nonideal delay . (3.19) 

The effort delay f we write as a product of logical effort, g , and electrical effort, 
h: 

f - gh . (3.20) 

So we can further partition delay into the following terms: 

delay = logical effort ¥ electrical effort + parasitic delay + nonideal delay . (3.21) 

The logical effort g is a function of the type of logic cell, 
g = RC/ 1 . (3.22) 

What size of logic cell do the R and C refer to? It does not matter because the R 
and C will change as we scale a logic cell, but the RC product stays the samethe 
logical effort is independent of the size of a logic cell. We can find the logical 
effort by scaling down the logic cell so that it has the same drive capability as the 
IX minimum-size inverter. Then the logical effort, g , is the ratio of the input 
capacitance, C in , of the IX version of the logic cell to C inv (see Figure 3.8 ). 
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FIGURE 3.8 Logical effort, (a) The input capacitance, C j nv , looking into the 
input of a minimum-size inverter in terms of the gate capacitance of a 
minimum-size device, (b) Sizing a logic cell to have the same drive strength as a 
minimum-size inverter (assuming a logic ratio of 2). The input capacitance 
looking into one of the logic-cell terminals is then C in . (c) The logical effort of 

a cell is C j n / C j nv . For a two-input NAND cell, the logical effort, g = 4/3. 



The electrical effort h depends only on the load capacitance C out connected to 
the output of the logic cell and the input capacitance of the logic cell, C in ; thus 



h = C out /C in .(3.23) 



The parasitic delay p depends on the intrinsic parasitic capacitance C p of the 
logic cell, so that 

p = RC p / 1 . (3.24) 



Table 3.2 shows the logical efforts for single-stage logic cells. Suppose the 
minimum-size inverter has an n -channel transistor with W/L = 1 and a p 
-channel transistor with W/L = 2 (logic ratio, r , of 2). Then each two-input 
NAND logic cell input is connected to an n -channel transistor with W/L = 2 and 
a p -channel transistor with W/L = 2. The input capacitance of the two-input 
NAND logic cell divided by that of the inverter is thus 4/3. This is the logical 
effort of a two-input NAND when r = 2. Logical effort depends on the ratio of the 
logic. For an n -input NAND cell with ratio r , the p -channel transistors are W/L 
= r /l, and the n -channel transistors are W/L = n /l. For a NOR cell the n 
-channel transistors are 1/1 and the p -channel transistors are nr /l. 



TABLE 3.2 Cell effort, parasitic delay, and nonideal delay (in units oft) for 
single-stage CMOS cells. 

Cell effort Cell effort 



Cell 



inverter 

n -input 
NAND 
n -input 
NOR 



(logic ratio = 2) (logic ratio = r) 



1 (by definition) 1 (by definition) 

( n + 2)/3 

(2 n + l)/3 ( nr + l)/( r + 1) n p inv 



Parasitic delay/ 1 
P inv (by 



Nonideal delay/ 



mv 



(by 



definition) J_ definition) 1 

(n + r)/(r+l) np inv n q inv 



nq 



inv 



The parasitic delay arises from parasitic capacitance at the output node of a 
single-stage logic cell and most (but not all) of this is due to the source and drain 
capacitance. The parasitic delay of a minimum-size inverter is 

P inv = C p /C inv . (3.25) 

The parasitic delay is a constant, for any technology. For our C5 technology we 
know RC p = 0.06 ns and, using Eq. 3.17 for a minimum-size inverter, we can 

calculate p j nv = RC p / 1 = 0.06/0.06 = 1 (this is purely a coincidence). Thus C p 
is about equal to C inv and is approximately 0.036 pF. There is a large error in 
calculating p inv from extracted delay values that are so small. Often we can 
calculate p mv more accurately from estimating the parasitic capacitance from 
layout. 

Because RC p is constant, the parasitic delay is equal to the ratio of parasitic 
capacitance of a logic cell to the parasitic capacitance of a minimum-size 



inverter. In practice this ratio is very difficult to calculated depends on the 
layout. We can approximate the parasitic delay by assuming it is proportional to 
the sum of the widths of the n -channel and p -channel transistors connected to 
the output. Table 3.2 shows the parasitic delay for different cells in terms of p inv 



The nonideal delay q is hard to predict and depends mainly on the physical size 
of the logic cell (proportional to the cell area in general, or width in the case of a 
standard cell or a gate-array macro), 

q = st q / 1 . (3.26) 

We define q inv in the same way we defined p inv . An n -input cell is 
approximately n times larger than an inverter, giving the values for nonideal 
delay shown in Table 3.2 . For our C5 technology, from Eq. 3.17 , q inv = t q i nv / 
t = 0.1 ns/0.06 ns = 1.7. 

3.3.1 Predicting Delay 

As an example, let us predict the delay of a three-input NOR logic cell with 2X 
drive, driving a net with a fanout of four, with a total load capacitance 
(comprising the input capacitance of the four cells we are driving plus the 
interconnect) of 0.3 pF. 

From Table 3.2 we see p = 3 p inv and q = 3 q inv for this cell. We can calculate 
C in from the fact that the input gate capacitance of a IX drive, three-input NOR 
logic cell is equal to gC inv , and for a 2X logic cell, C in = 2 gC inv . Thus, 

C out g-(0.3pF) (0.3 pF) 
gh = g = = .(3.27) 

c in 2 g C inv (2M0.036 pF) 

(Notice that g cancels out in this equation, we shall discuss this in the next 
section.) 

The delay of the NOR logic cell, in units of t , is thus 
0.3 ¥ 10 12 

d = gh + p + q = + (3)’(1) + (3)-( 1 .7) 

(2M0.036 ¥ 10 12 ) 

= 4.1666667 + 3 + 5.1 

= 12.266667 t . (3.28) 

equivalent to an absolute delay, t PD a 12.3 ¥ 0.06 ns = 0.74 ns. 



The delay for a 2X drive, three-input NOR logic cell in the C5 library is 



t p£> — (0.03 + 0.72 C out + 0.60) ns . (3.29) 

With C out = 0.3 pF, 

t pp) — 0.03 + (0.72)-(0.3) + 0.60 — 0.846 ns . (3.30) 

compared to our prediction of 0.74 ns. Almost all of the error here comes from 
the inaccuracy in predicting the nonideal delay. Logical effort gives us a method 
to examine relative delays and not accurately calculate absolute delays. More 
important is that logical effort gives us an insight into why logic has the delay it 
does. 

3.3.2 Logical Area and Logical Efficiency 

Figure 3.9 shows a single-stage OR-AND-INVERT cell that has different logical 
efforts at each input. The logical effort for the OAI221 is the logical-effort vector 
g = (7/3, 7/3, 5/3). For example, the first element of this vector, 7/3, is the logical 
effort of inputs A and B in Figure 3.9 . 



FIGURE 3.9 An OAI221 logic 
cell with different logical 
efforts at each input. In this 
case g = (7/3, 7/3, 5/3). The 
logical effort for inputs A and 
B is 7/3, the logical effort for 
inputs C and D is also 7/3, and 
for input E the logical effort is 
5/3. The logical area is the sum 
of the transistor areas, 33 
logical squares. 





We can calculate the area of the transistors in a logic cell (ignoring the routing 
area, drain area, and source area) in units of a minimum-size n -channel transistor 
we call these units logical squares . We call the transistor area the logical area . 
For example, the logical area of a IX drive cell, OAI221X1, is calculated as 
follows: 

• n -channel transistor sizes: 3/1 + 4 ¥ (3/1) 

• p -channel transistor sizes: 2/1 + 4 ¥ (4/1) 

• total logical area = 2 + (4¥4) + (5¥3) = 33 logical squares 

Figure 3.10 shows a single-stage AOI221 cell, with g = (8/3, 8/3, 6/3). The 
calculation of the logical area (for a AOI221X1) is as follows: 

• n -channel transistor sizes: 1/1 + 4 ¥ (2/1) 



• p -channel transistor sizes: 6/1 + 4 ¥ (6/1) 

• logical area = 1 + (4 ¥ 2) + (5 ¥ 6) = 39 logical squares 



FIGURE 3.10 An 
AND-OR-INVERT cell, 
an AOI221, with 
logical-effort vector, g = 
(8/3, 8/3, 7/3). The 
logical area is 39 logical 
squares. 
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These calculations show us that the single-stage AOI221, with an area of 33 
logical squares and logical effort of (7/3, 7/3, 5/3), is more logically efficient than 
the single-stage OAI221 logic cell with a larger area of 39 logical squares and 
larger logical effort of (8/3, 8/3, 6/3). 

3.3.3 Logical Paths 

When we calculated the delay of the NOR logic cell in Section 3.3.1, the answer 
did not depend on the logical effort of the cell, g (it cancelled out in Eqs. 3.27 
and 3.28 ). This is because g is a measure of the input capacitance of a IX drive 
logic cell. Since we were not driving the NOR logic cell with another logic cell, 
the input capacitance of the NOR logic cell had no effect on the delay. This is 
what we do in a data bookwe measure logic-cell delay using an ideal input 
waveform that is the same no matter what the input capacitance of the cell. 
Instead let us calculate the delay of a logic cell when it is driven by a 
minimum-size inverter. To do this we need to extend the notion of logical effort. 

So far we have only considered a single-stage logic cell, but we can extend the 
idea of logical effort to a chain of logic cells or logical path . Consider the logic 
path when we use a minimum- size inverter (go=l,po=l,qo = l-7)to drive 
one input of a 2X drive, three-input NOR logic cell with g 1 = ( nr + l)/( r + 1), p 
l = 3, q i =3, and a load equal to four standard loads. If the logic ratio is r = 1.5, 
then g ! = 5. 5/2.5 = 2.2. 

The delay of the inverter is 

d = g o h 0 + P 0 + 9 0 - (1) • (2g 1 ) ■ (C inv inv ) +1 + 1-7 (3.31) 

= (1)(2)(2.2) + 1 + 1.7 
= 7.1 . 



Of this 7.1 1 delay we can attribute 4.4 t to the loading of the NOR logic cell input 
capacitance, which is 2 g 1 C j nv . The delay of the NOR logic cell is, as before, d 
^gjhj+p^q^ 12.3, making the total delay 7.1 + 12.3 = 19.4, so the 
absolute delay is (19.4)(0.06 ns) = 1.164 ns, or about 1.2 ns. 

We can see that the path delay D is the sum of the logical effort, parasitic delay, 
and nonideal delay at each stage. In general, we can write the path delay as 



D= g i h i + ( P i + q i ) • (3.32) 

i path i path 



3.3.4 Multistage Cells 

Consider the following function (a multistage AOI221 logic cell): 
ZN(A1, A2,B1,B2, C) 

= NOT(NAND(NAND(Al, A2), AOI21(Bl, B2, C))) 

= (((A1-A2)' • (B1-B2 + C)’)’)' 

= (A1-A2 + B1-B2 + C)' 

= A0I221(A1,A2,B1,B2, C). (3.33) 



Figure 3.11 (a) shows this implementation with each input driven by a 
minimum-size inverter so we can measure the effect of the cell input capacitance. 
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= (1 xl .4 + 1 +1 ,7) + (1 4x1 + 2+3.4) + (1 4x0.7 + 2 + 3.4) + (1 xCi + 1 +1 ,7)=20+C L 




FIGURE 3.11 Logical paths, (a) An AOI221 logic cell constructed as a 
multistage cell from smaller cells, (b) A single-stage AOI221 logic cell. 



The logical efforts of each of the logic cells in Figure 3.11 (a) are as follows: 
g o = g 4 = g (NOT) = 1 , 

g i = g (AOI21) = (2, (2 r + l)/( r + 1)) = (2, 4/2.5) = (2, 1.6) , 
g 2 = g 3 = g (NAND2) = ( r + 2)/( r + 1) = (3.5)/(2.5) = 1.4 . (3.34) 

Each of the logic cells in Figure 3.11 has a IX drive strength. This means that 
the input capacitance of each logic cell is given, as shown in the figure, by gC inv 



Using Eq. 3.32 we can calculate the delay from the input of the inverter driving 
A1 to the output ZN as 

d 1 = (1)-(1.4) + 1 + 1.7 + (1.4)-(1) + 2 + 3.4 

+ (1.4M0.7) + 2 + 3.4 + (1). C L + 1 + 1.7 

= (20 + C L ) . (3.35) 

In Eq. 3.35 we have normalized the output load, C L , by dividing it by a 
standard load (equal to C inv ). We can calculate the delays of the other paths 
similarly. 

More interesting is to compare the multistage implementation with the 
single-stage version. In our C5 technology, with a logic ratio, r = 1.5, we can 
calculate the logical effort for a single-stage AOI221 logic cell as 

g (AOI221) = ((3 r + 2)/( r + 1), (3 r + 2)/( r + 1), (3 r + l)/( r + 1)) 

- (6. 5/2.5, 6. 5/2. 5, 5.5/2.5) 

= (2.6, 2.6, 2.2) . (3.36) 

This gives the delay from an inverter driving the A input to the output ZN of the 
single-stage logic cell as 

dl = ((l)-(2.6) + 1 + 1.7 + (1> C L + 5 + 8.5 ) 

= 18.8 + C L . (3.37) 

The single-stage delay is very close to the delay for the multistage version of this 
logic cell. In some ASIC libraries the AOI221 is implemented as a multistage 
logic cell instead of using a single stage. It raises the question: Can we make the 
multistage logic cell any faster by adjusting the scale of the intermediate logic 
cells? 

3.3.5 Optimum Delay 

Before we can attack the question of how to optimize delay in a logic path, we 
shall need some more definitions. The path logical effort G is the product of 
logical efforts on a path: 



gi -(3.38) 



G = 

i path 

The path electrical effort H is the product of the electrical efforts on the path, 

C 

v out 

H = hi ,(3.39) 
i path C j n 

where C out is the last output capacitance on the path (the load) and C j n is the 
first input capacitance on the path. 

The path effort F is the product of the path electrical effort and logical efforts, 

F = GH . (3.40) 

The optimum effort delay for each stage is found by minimizing the path delay D 
by varying the electrical efforts of each stage h j , while keeping H , the path 
electrical effort fixed. The optimum effort delay is achieved when each stage 
operates with equal effort, 

f A i = g i h i = F F N _ (3.41) 

This a useful result. The optimum path delay is then 
D A = NF F N = N ( GH ) 1/ N + P + q ? (3.42) 

where P + Q is the sum of path parasitic delay and nonideal delay, 

P + Q — P i + h i • (3.43) 
i path 

We can use these results to improve the AOI221 multistage implementation of 
Figure 3.11 (a). Assume that we need a IX cell, so the output inverter (cell 4) 
must have IX drive strength. This fixes the capacitance we must drive as C out = 
C inv (the capacitance at the input of this inverter). The input inverters are 

included to measure the effect of the cell input capacitance, so we cannot cheat 
by altering these. This fixes the input capacitance as C in = C inv . In this case H = 

1. 

The logic cells that we can scale on the path from the A input to the output are 
NAND logic cells labeled as 2 and 3. In this case 

G = g o ¥ g 2 ¥ g 3 = 1 ¥ 1.4 ¥ 1.4 = 1.95 . (3.44) 

Thus F = GH = 1.95 and the optimum stage effort is 1.95 (F3) = 1.25, so that the 
optimum delay NF F N - 3 .75. From Figure 3.11 (a) we see that 



g 0 h o + g 2 h 2 + g 3 h 3 - 1-4 + 1.3 + 1 - 3.8 . (3.45) 



This means that even if we scale the sizes of the cells to their optimum values, we 
only save a fraction of a t (3.8 3.75 = 0.05). This is a useful result (and one that 
is true in general) the delay is not very sensitive to the scale of the cells. In this 
case it means that we can reduce the size of the two NAND cells in the multicell 
implementation of an AOI221 without sacrificing speed. We can use logical 
effort to predict what the change in delay will be for any given cell sizes. 

We can use logical effort in the design of logic cells and in the design of logic 
that uses logic cells. If we do have the flexibility to continuously size each logic 
cell (which in ASIC design we normally do not, we usually have to choose from 
IX, 2X, 4X drive strengths), each logic stage can be sized using the equation for 
the individual stage electrical efforts, 

p 1/ N 

h A j = . (3.46) 

gi 



For example, even though we know that it will not improve the delay by much, 
let us size the cells in Figure 3.11 (a). We shall work backward starting at the 
fixed load capacitance at the input of the last inverter. 

For NAND cell 3, gh = 1.25; thus (since g = 1.4), h - C out / C in = 0.893. The 
output capacitance, C out , for this NAND cell is the input capacitance of the 
inverterfixed as 1 standard load, C inv . This fixes the input capacitance, C in , of 
NAND cell 3 at 1/0.893 = 1.12 standard loads. Thus, the scale of NAND cell 3 is 
1.12/1.4 or 0.8X. 

Now for NAND cell 2, gh = 1.25; C out for NAND cell 2 is the C in of NAND 
cell 3. Thus C in for NAND cell 2 is 1.12/0.893 = 1.254 standard loads. This 
means the scale of NAND cell 2 is 1.254/1.4 or 0.9X. 

The optimum sizes of the NAND cells are not very different from IX in this case 
because H = 1 and we are only driving a load no bigger than the input 
capacitance. This raises the question: What is the optimum stage effort if we have 
to drive a large load, H » 1? Notice that, so far, we have only calculated the 
optimum stage effort when we have a fixed number of stages, N . We have said 
nothing about the situation in which we are free to choose, N , the number of 
stages. 

3.3.6 Optimum Number of Stages 

Suppose we have a chain of N inverters each with equal stage effort, f = gh . 
Neglecting parasitic and nonideal delay, the total path delay is Nf = Ngh = Nh , 
since g = 1 for an inverter. Suppose we need to drive a path electrical effort H ; 



then h N = H , or N In h = In H . Thus the delay, Nh = h In H /In h . Since In H is 
fixed, we can only vary h /In ( h ). Figure 3.12 shows that this is a very shallow 
function with a minimum at h = e a 2.718. At this point In h = 1 and the total 
delay is N e = e In H . This result is particularly useful in driving large loads 
either on-chip (the clock, for example) or off-chip (I/O pad drivers, for example). 



FIGURE 3.12 Stage effort. 
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Figure 3.12 shows us how to minimize delay regardless of area or power and 
neglecting parasitic and nonideal delays. More complicated equations can be 
derived, including nonideal effects, when we wish to trade off delay for smaller 
area or reduced power. 



1. For the Compass 0.5 m m technology (C5): p j nv = 1.0, q j nv = 1.7, R inv =1.5 
k W , C inv = 0.036 pF. 



3.4 Library-Cell Design 

The optimum cell layout for each process generation changes because the design 
rules for each ASIC vendors process are always slightly differenteven for the 
same generation of technology. For example, two companies may have very 
similar 0.35 m m CMOS process technologies, but the third-level metal spacing 
might be slightly different. If a cell library is to be used with both processes, we 
could construct the library by adopting the most stringent rules from each 
process. A library constructed in this fashion may not be competitive with one 
that is constructed specifically for each process. Even though ASIC vendors prize 
their design rules as secret, it turns out that they are similarexcept for a few 
details. Unfortunately, it is the details that stop us moving designs from one 
process to another. Unless we are a very large customer it is difficult to have an 
ASIC vendor change or waive design rules for us. We would like all vendors to 
agree on a common set of design rules. This is, in fact, easier than it sounds. The 
reason that most vendors have similar rules is because most vendors use the same 
manufacturing equipment and a similar process. It is possible to construct a 
highest common denominator library that extracts the most from the current 
manufacturing capability. Some library companies and the large Japanese ASIC 
vendors are adopting this approach. 

Layout of library cells is either hand-crafted or uses some form of symbolic 
layout . Symbolic layout is usually performed in one of two ways: using either 
interactive graphics or a text layout language. Shapes are represented by simple 
lines or rectangles, known as sticks or logs , in symbolic layout. The actual 
dimensions of the sticks or logs are determined after layout is completed in a 
postprocessing step. An alternative to graphical symbolic layout uses a text 
layout language, similar to a programming language such as C, that directs a 
program to assemble layout. The spacing and dimensions of the layout shapes are 
defined in terms of variables rather than constants. These variables can be 
changed after symbolic layout is complete to adjust the layout spacing to a 
specific process. 

Mapping symbolic layout to a specific process technology uses 1020 percent 
more area than hand-crafted layout (though this can then be further reduced to 5 
10 percent with compaction). Most symbolic layout systems do not allow 45° 
layout and this introduces a further area penalty (my experience shows this is 
about 515 percent). As libraries get larger, and the capability to quickly move 
libraries and ASIC designs between different generations of process technologies 
becomes more important, the advantages of symbolic layout may outweigh the 
disadvantages. 




3.5 Library Architecture 



Figure 3.13 (a) shows cell use data from over 150 CMOS gate array designs. 
These results are remarkably similar to that from other ASIC designs using 
different libraries and different technologies and show that typically 80 percent of 
an ASIC uses less than 20 percent of the cell library. 
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FIGURE 3.13 Cell library statistics. 
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We can use the data in Figure 3.13 (a) to derive some useful conclusions about the 
number and types of cells to be included in a library. Before we do this, a few 



words of caution are in order. First, the data shown in Figure 3.13 (a) tells us about 
cells that are included a library. This data cannot tell us anything about cells that 
are not (and perhaps should be) included in a library. Second, the type of design 
entry we useand the type of ASIC we are designingcan dramatically affect the 
profile of the use of different cell types. For example, if we use a high-level design 
language, together with logic synthesis, to enter an ASIC design, this will favor the 
use of the complex combinational cells (cells of the AOI family that are 
particularly area efficient in CMOS, but are difficult to work with when we design 
by hand). 

Figure 3.13 (a) tells us which cells we use most often, but does not take into 
account the cell area. What we really want to know are which cells are most 
important in determining the area of an ASIC. Figure 3.13 (b) shows the area of 
the cellsnormalized to the area of a minimum- size inverter. If we take the data in 
Figure 3.13 (a) and multiply by the cell areas, we can derive a new measure of the 
contribution of each cell in a library (Figure 3.13c). This new measure, cell 
importance , is a measure of how much area each cell in a library contributes to a 
typical ASIC. For example, we can see from Figure 3.13 (c) that a D flip-flop 
(with a cell importance of 3.5) contributes 3.5 times as much area on a typical 
ASIC than does an inverter (with a cell importance of 1). 

Figure 3.13 (c) shows cell importance ordered by the cell frequency of use and 
normalized to an inverter. We can rearrange this data in terms of cell importance, 
as shown in Figure 3.13 (d), and normalized so that now the most important cell, a 
D flip-flop, has a cell importance of 1 . Figure 3.13 (e) includes the cell use data on 
the same scale as the cell importance data. Both show roughly the same shape, 
reflecting that both measures obey an 8020 rule. Roughly 20 percent of the cells in 
a library correspond to 80 percent of the ASIC area and 80 percent of the cells we 
use (but not the same 20 percentthat is why cell importance is useful). 

Figure 3.13 (e) shows us that the most important cells, measured by their 
contribution to the area of an ASIC, are not necessarily the cells that we use most 
often. If we wish to build or buy a dense library, we must concentrate on the area 
of those cells that have the highest cell importancenot the most common cells. 



3.6 Gate-Array Design 

Each logic cell or macro in a gate-array library is predesigned using fixed tiles of 
transistors known as the gate-array base cell (or just base cell ). We call the 
arrangement of base cells across a whole chip in a complete gate array the 
gate-array base (or just base ). ASIC vendors offer a selection of bases, with a 
different total numbers of transistors on each base. For example, if our ASIC 
design uses 48k equivalent gates and the ASIC vendor offers gate arrays bases 
with 50k-, 75k-, and lOOk-gates, we will probably have to use the 75k-gate base 
(because it is unlikely that we can use 48/50 or 96 percent of the transistors on 
the 50k-gate base). 

We isolate the transistors on a gate array from one another either with thick field 
oxide (in the case of oxide-isolated gate arrays) or by using other transistors that 
are wired permanently off (in gate-isolated gate arrays). Channeled and 
channelless gate arrays may use either gate isolation or oxide isolation. 

Figure 3.14 (a) shows a base cell for a gate-isolated gate array . This base cell 
has two transistors: one p -channel and one n -channel. When these base cells are 
placed next to each other, the n -diffusion and p -diffusion layers form continuous 
strips that run across the entire chip broken only at the poly gates that cross at 
regularly spaced intervals (Figure 3.14b). The metal interconnect spacing 
determines the separation of the transistors. The metal spacing is determined by 
the design rules for the metal and contacts. In Figure 3.14 (c) we have shown all 
possible locations for a contact in the base cell. There is room for 21 contacts in 
this cell and thus room for 21 interconnect lines running in a horizontal direction 
(we use ml running horizontally). We say that there are 21 horizontal tracks in 
this cell or that the cell is 21 tracks high. In a similar fashion the space that we 
need for a vertical interconnect (m2) is called a vertical track . The horizontal and 
vertical track widths are not necessarily equal, because the design rules for ml 
and m2 are not always equal. 

We isolate logic cells from each other in gate-isolated gate arrays by connecting 
transistor gates to the supply bushence the name, gate isolation . If we connect 
the gate of an n -channel transistor to V ss , we isolate the regions of n -diffusion 

on each side of that transistor (we call this an isolator transistor or device, or just 
isolator). Similarly if we connect the gate of a p -channel transistor to V DD , we 

isolate adjacent p -diffusion regions. 
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FIGURE 3.14 The construction of a gate-isolated gate array, (a) The 
one-track-wide base cell containing one p -channel and one n -channel transistor, 
(b) Three base cells: the center base cell is being used to isolate the base cells on 
either side from each other, (c) A base cell including all possible contact 
positions (there is room for 21 contacts in the vertical direction, showing the 
base cell has a height of 21 tracks). 



Oxide-isolated gate arrays often contain four transistors in the base cell: the two n 
-channel transistors share an n -diffusion strip and the two p -channel transistors 
share a p -diffusion strip. This means that the two n -channel transistors in each 
base cell are electrically connected in series, as are the p -channel transistors. The 
base cells are isolated from each other using oxide isolation . During the 
fabrication process a layer of the thick field oxide is left in place between each 
base cell and this separates the p -diffusion and n -diffusion regions of adjacent 
base cells. 

Figure 3.15 shows an oxide-isolated gate array . This cell contains eight 
transistors (which occupy six vertical tracks) plus one-half of a single track that 
contains the well contacts and substrate connections that we can consider to be 
shared by each base cell. 




FIGURE 3.15 An oxide-isolated gate-array base cell. The figure shows two base 
cells, each containing eight transistors and two well contacts. The p -channel and 
n -channel transistors are each 4 tracks high (corresponding to the width of the 
transistor). The leftmost vertical track of the left base cell includes all 12 
possible contact positions (the height of the cell is 12 tracks). As outlined here, 
the base cell is 7 tracks wide (we could also consider the base cell to be half this 
width). 

Figure 3.16 shows a base cell in which the gates of the n -channel and p -channel 
transistors are connected on the polysilicon layer. Connecting the gates in poly 
saves contacts and a metal interconnect in the center of the cell where 
interconnect is most congested. The drawback of the preconnected gates is a loss 
in flexibility in cell design. Implementing memory and logic based on 
transmission gates will be less efficient using this type of base cell, for example. 
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FIGURE 3.16 This oxide-isolated gate-array base cell is 14 tracks high and 4 
tracks wide. VDD (tracks 3 and 4) and GND (tracks 1 1 and 12) are each 2 tracks 
wide. The metal lines to the left of the cell indicate the 10 horizontal routing 
tracks (tracks 1, 2, 510, 13, 14). Notice that the p -channel and n -channel 
polysilicon gates are tied together in the center of the cell. The well contacts are 
short, leaving room for a poly cross-under in each base cell. 



Figure 3.17 shows the metal personalization for a D flip-flop macro in a 
gate-isolated gate array using a base cell similar to that shown in Figure 3.14 (a). 
This macro uses 20 base cells, for a total of 40 transistors, equivalent to 10 gates. 
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FIGURE 3.17 An example of a flip-flop macro in a gate-isolated gate-array 
library. Only the first-level metallization and contact pattern (the 
personalization) is shown on the right, but this is enough information to derive 
the schematic. The base cell is shown on the left. This macro is 20 tracks wide. 



The gates of the base cells shown in Figures 3.14 3.16 are bent. The bent gate 
allows contacts to the gates to be placed on the same grid as the contacts to 
diffusion. The poly silicon gates run in the space between adjacent metal 
interconnect lines. This saves space and also simplifies the routing software. 

There are many trade-offs that determine the gate-array base cell height. One 
factor is the number of wires that can be run horizontally through the base cell. 
This will determine the capacity of the routing channel formed from an unused 
row of base cells. The base cell height also determines how easy it is to wire the 
logic macros since it determines how much space for wiring is available inside 
the macros. 

There are other factors that determine the width of the base-cell transistors. The 
widths of the p -channel and n -channel transistors are slightly different in Figure 
3.14 (a). The p -channel transistors are 6 tracks wide and the n -channel 
transistors are 5 tracks wide. The ratio for this gate-array library is thus 
approximately 1.2. Most gate-array libraries are approaching a ratio of 1. 

ASIC designers are using ever-increasing amounts of RAM on gate arrays. It is 
inefficient to use the normal base cell for a static RAM cell and the size of RAM 
on an embedded gate array is fixed. As an alternative we can change the design 
of the base cell. A base cell designed for use as RAM has extra transistors (either 
fourtwo n -channel and two p -channelor two n -channel; usually minimum 



width) allowing a six-transistor RAM cell to be built using one base cell instead 
of the two or three that we would normally need. This is one of the advantages of 
the CBA (cell-based array) base cell shown in Figure 3.18 . 
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FIGURE 3.18 The SiARC/Synopsys cell-based array (CBA) basic cell. 



3.7 Standard-Cell Design 

Figure 3.19 shows the components of the standard cell from Figure 1.3. Each 
standard cell in a library is rectangular with the same height but different widths. The 
bounding box ( BB ) of a logic cell is the smallest rectangle that encloses all of the 
geometry of the cell. The cell BB is normally determined by the well layers. Cell 
connectors or terminals (the logical connectors ) must be placed on the cell abutment 
box ( AB ). The physical connector (the piece of metal to which we connect wires) 
must normally overlap the abutment box slightly, usually by at least 1 1 , to assure 
connection without leaving a tiny space between the ends of two wires. The standard 
cells are constructed so they can all be placed next to each other horizontally with the 
cell ABs touching (we abut two cells). 
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FIGURE 3.19 (a) The standard cell shown in Figure 1.3. (b) Diffusion, poly, and 
contact layers, (c) ml and contact layers, (d) The equivalent schematic. 

A standard cell (a D flip-flop with clear) is shown in Figure 3.20 and illustrates the 
following features of standard-cell layout: 

• Fayout using 45° angles. This can save 10%20% in area compared to a cell 
that uses only Manhattan or 90° geometry. Some ASIC vendors do not allow 
transistors with 45° angles; others do not allow 45° angles at all. 

• Connectors are at the top and bottom of the cell on m2 on a routing grid equal 
to the vertical (m2) track spacing. This is a double-entry cell intended for a 
two-level metal process. A standard cell designed for a three-level metal 
process has connectors in the center of the cell. 

• Transistor sizes vary to optimize the area and performance but maintain a fixed 
ratio to balance rise times and fall times. 

• The cell height is 64 1 (all cells in the library are the same height) with a 
horizontal (ml) track spacing of 8 1 . This is close to the minimum height that 
can accommodate the most complex cells in a library. 

• The power rails are placed at the top and bottom, maintaining a certain width 
inside the cell and abut with the power rails in adjacent cells. 

• The well contacts (substrate connections) are placed inside the cell at regular 
intervals. Additional well contacts may be placed in spacers between cells. 

• In this case both wells are drawn. Some libraries minimize the well or moat 
area to reduce leakage and parasitic capacitance. 

• Most commercial standard cells use ml for the power rails, ml for internal 
connections, and avoid using m2 where possible except for cell connectors. 




FIGURE 3.20 AD flip-flop standard cell. The wide power buses and 
transistors show this is a performance-optimized cell. This double-entry cell is 
intended for a two-level metal process and channel routing. The five 
connectors run vertically through the cell on m2 (the extra short vertical metal 
line is an internal crossover). 

When a library developer creates a gate-array, standard-cell, or datapath library, there 
is a trade-off between using wide, high-drive transistors that result in large cells with 
high-speed performance and using smaller transistors that result in smaller cells that 
consume less power. A performance-optimized library with large cells might be used 
for ASICs in a high-performance workstation, for example. An area-optimized library 
might be used in an ASIC for a battery-powered portable computer. 




3.8 Datapath-Cell Design 



Figure 3.21 shows a datapath flip-flop. The primary, thicker, power buses run 
vertically on m2 with thinner, internal power running horizontally on ml. The 
control signals (clock in this case) run vertically through the cell on m2. The 
control signals that are common to the cells above and below are connected 
directly in m2. The other signals (data, q, and qbar in this example) are brought 
out to the wiring channel between the rows of datapath cells. 




FIGURE 3.21 A datapath D flip-flop cell. 

Figure 3.22 is the schematic for Figure 3.21 . This flip-flop uses a pair of 
cross-coupled inverters for storage in both the master and slave latches. This 
leads to a smaller and potentially faster layout than the flip-flop circuits that we 










use in gate-array and standard-cell ASIC libraries. The device sizes of the 
inverters in the data-path flip-flops are adjusted so that the state of the latches 
may be changed. Normally using this type of circuit is dangerous in an 
uncontrolled environment. However, because the datapath structure is regular and 
known, the parasitic capacitances that affect the operation of the logic cell are 
also known. This is another advantage of the datapath structure. 



voo 




FIGURE 3.22 The schematic of the datapath D flip-flop cell shown in Figure 
3.21 . 



Figure 3.23 shows an example of a datapath. Figure 3.23 (a) depicts a two-level 
metal version showing the space between rows or slices of the datapath. In this 
case there are many connections to be brought out to the right of the datapath, 
and this causes the routing channel to be larger than normal and thus easily seen. 
Figure 3.23 (b) shows a three-level metal version of the same datapath. In this 
case more of the routing is completed over the top of the datapath slices, reducing 
the size of the routing channel. 

(a) 



FIGURE 3.23 A datapath, (a) 
Implemented in a two-level 
metal process, (b) Implemented 
in a three-level metal process. 









3.9 Summary 

In this chapter we covered ASIC libraries: cell design, layout, and 
characterization. The most important concepts that we covered in this chapter 
were 

• Tau, logical effort, and the prediction of delay 

• Sizes of cells, and their drive strengths 

• Cell importance 

• The difference between gate-array macros, standard cells, and datapath 
cells 




3.10 Problems 



* = difficult, ** = very difficult, *** = extremely difficult 

3.1 (Pull resistance, 10 min.) 

• a. Show that, for small V DS , an n -channel transistor looks like a resistor, R = 
l/( b n ( V DD V tn ». 

• b. If V GS = V DD , V DS = 0, and k ' n = 200 m AV 2 (equal to the n -channel 
transistor SPICE parameter KP in Table 2.1), find the pull resistance, R , for a 
6/0.6 transistor in the linear region. 

Answer: (b) 213 W . 

3.2 (Inversion layer depth, 15 min.) In the absence of surface charge, Gausss law 
demands continuity of the electric displacement vector, D = e E , at the silicon surface, 
so that e ox E ox = e Si E Si , where e ox = 3.9, e Si = 1 1.7. 

• a. Assuming the potential at the surface is V GS V t = 2.5 V, calculate E ox and 

E S i ifT ox= 100 A. 

• b. Assume that carrier density exp (q f /kT), where f is the potential; calculate 
the distance below the surface at which the inversion charge density falls to 10 
percent of its value at the surface. 

• c. Comment on the accuracy of your answers. 

Answer: (a) 2.5 ¥ 10 8 Vm 1 , 0.833 ¥ 10 8 Vm 1 . (b) 7.16 A. 

3.3 (Depletion layer depth, 15 min.) The depth of the depletion region under the gate is 
given by x d = -f [ (2 e Si f s )/(qN A )], where f s = 2V T ln(N A /n j ) is the surface 

potential at strong inversion. Calculate f s and x d assuming: e Si =1.0359 ¥ 10 10 Fm 1 
, the substrate doping, N A = 1.4 ¥ 10 17 cm 3 , the intrinsic carrier concentration n j = 
1.45 ¥ 10 10 cm 3 (at room temperature), and the thermal voltage V T = kT/q = 25.9 
mV. 

Answer: 0.833 V, 900 A. 

3.4 (Logical effort, 45 min.) Calculate the logical effort at each input of an AOI122 
cell. Find an expression that allows you to calculate the logical effort for each input of 
an AOI nnnn cell for n = 1, 2, 3. 

3.5 (Gate-array macro design, 120 min.) Draw a IX drive, two-input NAND cell using 
the gate-array base cells shown in Figures 3.14 (a) 3.16 (lay a piece of thin paper over 
the figures and draw the contacts and metal personalization only). Label the inputs and 
outputs. Lay out a IX drive, four-input NAND cell using the same base array cells. 



Now lay out a 2X drive, four-input NAND cell (think about this one). Make sure that 
you size your transistors properly to balance rise times and fall times. 

3.6 (Flip-flop library, 20 min.) Suppose we wish to build a library of flip-flops. We 
want to have flops with: positive-edge and negative-edge triggering: clear, preset 
(either, both, or neither); synchronous or asynchronous reset and preset controls if 
present (but not mixed on the same flip-flop); all flip-flops with or without scan as an 
option; flip-flops with Q and Qbar (either or both). How many flip-flops is that? 

(***) How would you attempt to prioritize which flip-flops to include in a library? 

3.7 (AOI and OAI cell ratios, 30 min.) In Figure 2.13(c) we adjusted the sizes of the 
transistors assuming that there was only one path through the n -channel and p 
-channel stacks. Suppose that p -channel transistors A, B, C, and D are all on and p 
-channel transistor E turns on. What is the equivalent resistance of the p -channel stack 
in this case? 



3.8 (**Eight-input AND, 60 min.) This question is an example in the paper by 
Sutherland and Sproull [1991] on logical effort. Figure 3.24 shows three different 
ways to design an eight-input AND cell, using NAND and NOR cells. 

• a. Find the logical effort at each input for A, B, C. Assume a logic ratio of 2. 

• b. Find the parasitic delay for A, B, C. Assume the parasitic delay of an inverter 
is 0.6. 

• c. Show that the path delays are given by the following equations where H is the 
path electrical effort, if we ignore the nonideal delays: 

• (i) 2 (3.33 H ) 0 5 + 5.4 (alternative A) 

• (ii) 2 (3.33 H ) 0 5 + 3.6 (alternative B) 

• (iii) 4 (2.96 H ) °- 25 + 4.2 (alternative C) 

• d. Use these equations to determine the best alternative for H = 2 and H = 32. 



B 


& 


4 


& 


n — 


21 




& 


- j — 


21 > 


FIGURE 3.24 




A 4 


& 


Lr 




B 


& 


T 




An eight-input 
















AND cell 










& 


D 


21 > 


(Problem T8 
1. 










& 


T 





3.9 (Special logic cells, 30 min.) Many ASIC cell libraries contain special logic cells. 
For example the Compass libraries contain a two-input NAND cell with an inverted 
input, FN01 = (A + B'). This saves routing area, is faster than using two separate cells, 
and is useful because the combination of a two-input NAND gate with one inverted 
input is heavily used by synthesis tools. Other special cells include: 

• FN02 = MAJ3 = (AB + AC + BC)' 

• FN03 = AOI2-2 = ((A'-B 1 ) + (C-D))' = (A + B)(C' + D') = OA2-2 

• FN04 = OAI2-2 

• FN05 = AB' = (A 1 + B)' 

• a. Draw schematics for these cells. 



• b. Calculate the logical effort and logical area for each cell. 

• c. Can you explain where and why these cells might be useful? 

3.10 (Euler paths, 60 min.) There are several ways to arrange the stacks in the AOI21 1 
cell shown in Figure 3.25 . For example, the n -channel transistor A can be below B 
without altering the function. Which arrangement would you predict gives a faster 
delay from A to Z and why? The p -channel transistors A and B can be above or below 
transistors C and D. How many distinct ways of arranging the transistors are there for 
this cell? What effect do the different arrangements have on layout? What effects do 
these different arrangements have on the cell performance? 
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FIGURE 3.25 There are several ways to arrange the transistors in this AOI211 cell 
(Problem 3.10 ). 

3.11 (*AOI and OAI cell efficiency, 60 min.) A standard-cell library data book 
contains the following data: 

• AOI221: t R = 1.061.15 ns; t F = 1.091.55 ns; C in = 0.210.28 pF; W c = 28.8 
m m 

• OAI221: t R = 0.771.05 ns; t F = 0.810.96 ns; C in = 0.250.39 pF; W c = 22.4 
m m 

( W c is the cell width, the cell height is 25.6 m m.) Calculate the (a) logical effort and 

(b) logical area for the AOI221 and OAI221 cells. 

The implementation of the OAI221 in this library uses a single stage, 

OAI221 = OAI221(al, a2, bl, b2, c), 

whereas the AOI221 uses the following multistage implementation: 

AOI221 = NOT (NAND(NAND(a 1 , a2), AOI21(bl, b2, c))). 

(c) What are the alternative implementations for these two cells? (d) From your 
answers attempt to explain the implementations chosen. 

3.12 (**Fogical efficiency, 60 min.) Extending Problem 3.11 , let us compare an 
AOI33 with an OAI33 cell, (a) Calculate the logical effort and (b) logical areas for 
these cells. 



The AOI33 uses a single-stage implementation as follows: 

AOI33 = aoi33(al, a2, a3, bl, b2, b3). 

The OAI33 uses the following multistage implementation: 

OAI33 = not[nor[nor(al, a2, a3), nor(bl, b2, b3)]]. 

(c) Calculate the path delay, D , as a function of path electrical effort, H , for both of 
these implementations ignoring parasitic and nonideal delays, (d) Use Eq. 3.42 to 
calculate the optimum path delay for these cells, (e) Compare and explain the 
differences between your answers to parts d and e for H = 1, 2, 4, and 8. 

The timing data from the data book is as follows (the cell height is 25.6 m m): 

• AOI33: t R = 0.701.06 ns; t F = 0.721.15 ns; C in = 0.210.28 pF; W c = 35.2 
m m 

• OAI33: t R = 1.061.70 ns; t F = 1.421.98 ns; C in = 0.310.36 pF; W c = 48 m 
m 

(f) How does this data compare with your theoretical analysis? 

3.13 (EXOR cells and logical effort, 60 min.) Show how to implement a two-input 
EXOR cell using an AOI22 and two inverters. Using logical effort, compare this with 
an implementation using an AOI21 cell and a NOR cell. 

3.14 (***XNOR cells, 60 min.) Table 3.3 shows the implementation of XNOR cells 
in a standard-cell library. Analyze this data using the concept of logical effort. 

TABLE 3.3 Implementations of XNOR cells in CMOS (Problem 3.14 ). 

Implementation 

nand[or(a 1 ,a2),nand(a 1 ,a2)] 

NOT[NOT[MUX[al, NOT(al),a2)]] 
NOT[NOT[MUX(al,NOT(al),a2)]] 
nand[or(a 1 ,a2),nand(a 1 ,a2)] 

NOT[NOT[MUX(al , NOT(al), NOT(MUX(a3, NOT(a3),a2)))]] 
NOT[NOT[MUX(al , NOT(al), NOT(MUX(a3, NOT(a3),a2)))]] 

3.15 (***Extensions to logical effort, 60 min.) The path branching effort B is the 
product of branching efforts: 

B = b i . (3.47) 
i path 



Cell 

Library 1: 
XNOR2D1 
Library 2: 
XNOR2D1 

Library 1: 
XNOR2D2 
Library 2: 
XNOR2D2 
Library 1: 
XNOR3D1 
Library 1: 
XNOR3D2 



The branching effort is the ratio of the on-path plus off-path capacitance to the on-path 
capacitance. The path effort F becomes the product of the path electrical effort, path 
branching effort, and path logical effort: 

F = GBH . (3.48) 

Show that the path delay D is 



D= g i b i h i + Pi . (3.49) 
i path i path 

(***) Show that the optimum path delay is then 
D A = NF l/ N = N ( GBH ) 11 N + P . (3.50) 

3.16 (*Circuits from layout, 120 min.) Figure 3.26 shows a D flip-flop with clear 
from a 1.0mm standard-cell library. Figure 3.27 shows two layout views of this D 
flip-flop. Construct the circuit diagram for this flip-flop, labeling the nodes and 
transistors as shown. Include the transistor sizesuse estimates for transistors with 45° 
gatesyou only need W/L values, you can assume the gate lengths are all L = 2 1 , equal 
to the minimum feature size. Label the inputs and outputs to the cell and identify their 
functions. 




FIGURE 3.26 A D flip-flop from a 1.0 m m standard-cell library (Problem 3.16 ). 

3.17 (Flip-flop circuits, 30 min.) Draw the circuit schematic for a positive-edge 
triggered D flip-flop with active-high set and reset (base your schematic on 
Figure 2.18a, a negative-edgetriggered D flip-flop). Describe the problem when both 
SET and RESET are high. 
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FIGURE 3.27 (Top) A standard cell showing the diffusion ( n -diffusion and p 
-diffusion), poly, and contact layers (the n -well and p -well are not shown). 
(Bottom) Shows the ml, contact, m2, and via layers. Problem 3.16 traces this circuit 
for this cell. 



If we want an active-high set or reset we can: (1) use an inverter on the set or reset 
signal or (2) we can substitute NOR cells. Since NOR cells are slower than NAND 
cells, which we do depends on whether we want to optimize for speed or area. 

Thus, the largest flip-flop would be one with both Q and QN outputs, active high set 
and resetrequiring four TX gates, three inverters (four of the seven we normally need 
are replaced with NAND cells), four NAND cells, and two inverters to invert the set 
and reset, making a total of 34 transistors, or 8.5 gates. 

3.18 (Set and reset, 10 min.) Show how to add a synchronous set or a synchronous 
reset to the flip-flop of Figure 2.18(a) using a two-input MUX. 

3.19 (Clocked inverters, 45 min.) Using PSpice compare the delay of an inverter with 
transmission gate with that of a clocked inverter using the G5 process SPICE 
parameters from Table 2.1. 




3.20 (S-R, T, J-K flip-flops, 30 min.) The characteristic equation for a D flip-flop is Q 
t+1 = D. The characteristic equation for a J-K flip-flop is Q t+1 = J(Q t )' + K'Q t . 

• a. Show how you can build a J-K flip-flop using a D flip-flop. 

• b. The characteristic equation for a T flip-flop (toggle flip-flop) is Q t+1 = (Q t )' . 
Show how to build a T flip-flop using a D flip-flop. 

• c. The characteristic equation does not show the timing behavior of a sequential 
elementthe characteristic equation for a D latch is the same as that for a D 
flip-flop. The characteristic equation for an S-R latch and an S-R flip-flop is Q 
t+i = S + R'Q t . An S-R flip-flop is sometimes called a pulse-triggered flip-flop. 
Find out the behavior of an S-R latch and an S-R flip-flop and describe the 
differences between these elements and a D latch and a D flip-flop. 

• d. Explain why it is probably not a good idea to use an S-R flip-flop in an ASIC 
design. 

3.21 (**Optimum logic, 60 min.) Suppose we have a fixed logic path of length n i . 

We want to know how many (if any) buffer stages we should add at the output of this 
path to optimize the total path delay given the output load capacitance. 

• a. If the total number of stages is N (logic path of length n j plus N n ± 
inverters), show that the total path delay is 

n l 

D A = NF 1/N + ( Pi + qi ) + (Nn i )( p inv 0 inv ) • (3.51) 
i = 1 

The optimum number of stages is given by the solution to the following equation: 

D A /N =/N(NF 1/N + (N n i )( p inv + q inv ) ) = 0 . (3.52) 



• b. Show that the solutions to this equation can be written in terms of F NA (the 
optimum stage effort) where N A is the optimum number of stages: 

F 1/ N A (i lnF 1/ N A ) + ( p jnv + q , nv ) = o . (3.53) 

3.22 (XOR and XNOR cells, 60 min.) Table 3.4 shows the implementations of two- 
and three-input XOR cells in an ASIC standard-cell library (D1 are the IX drive cells, 
and D2 are the 2X drive versions). Can you explain the choices for the two-input XOR 
cell and complete the table for the three-input XOR cell? 

TABFE 3.4 Implementations of XOR cells (Problem 3.22 ). 

Cell Actual implementation J_ Alternative implementation(s) 

XOR2D1 AOI21[al, a2, NOR(al,a2)] not[mux(al, not(al), a2)] 

aoi22(al, a2, not(al), not(a2)) 
aoi21[al, a2, nor(al, a2)] 
aoi22(al, a2, not(al), not(a2)) 



XOR2D2 NOT[MUX(al, not(al), a2)] 



VAD , n , NOT[MUX[al, not(al), not(mux(a3, not(a3), 0 
X0R3D1 a2))]] ? 

v „ D , n . NOT[MUX[al, not(al), not(mux(a3, not(a3), 0 
XOR3D2 a2))]] ? 

3.23 (Library density, 10 min.) Derive an upper limit on cell density as follows: 
Assume a chip consists only of two-input NAND cells with no routing channels 
between rows (often achievable in a 3LM process with over-the-cell routing). 

• a. Explain how many vertical tracks you need to connect to a two-input NAND 
cell, assuming each connection requires a separate track. 

• b. If the NAND cell is 64 1 high with a vertical track width of 8 1 , calculate the 
NAND cell area, carefully explaining any assumptions. 

• c. Calculate the cell density (in gate/mil 2 ) for a 0.35 m m process, 1 = 0.175 m 
m. 

Answer: 3 tracks, 47 m m 2 , 13.7 gates/mil 2 or 21 ¥ 103 gates/mm 2 . 

3.24 (Gate- array density, 20 min.) The LSI Logic 10k and 100k gate arrays use a 
four-transistor base cell, equivalent to 1 gate, that is 12 tracks high and 3 tracks wide. 

• a. If a metal track is 81, where 1 = 0.75 m m for a 1.5 m m technology, calculate 
the area of the LSI Logic base cell A L in mil 2 . 

• b. If we could use every base cell in the gate array, the cell density would be D G 
= 1/ A L . Assume that, because of routing area and inefficiency of the gate 
array, we can use only 50 percent of the base cells for logic. What is D G for the 
LSI Logic 1.5 m m array? 

• c. Chip cell density D G is about 1.0 gate/mil 2 for a 1 m m technology (a 
two-input NAND cell occupies an area 25 m m on a side in a technology whose 
transistors are 1mm long). This can change by a factor of 2 or more for a 
gate-array/standard-cell ASIC or high-density/high-performance library. 

Assume that cell density D G scales ideally with technology. If the minimum 

feature size of a technology is 21, then D G 1/1 2 . Thus, for example, a 1.5 m m 

technology should have a cell density of roughly (1/1.5) 2 gates/mil 2 . How does 
this agree with your estimate for the LSI Logic array? 

3.25 (SiArc RAM, 10 min.) Suppose we need 16 k-bit of SRAM and 20 k-gate of 
random logic on a channelless gate array. Assume a base cell with four transistors and 
that we can build a RAM cell using two of these base cells. The RAM bits will require 
32k base cells and the random logic will require 20k base cells. Suppose the base cell 
area is 12 tracks high, 3 tracks wide, and the horizontal and vertical track spacing is 
equal at 8 1 . 

• a. Calculate the total area of the base cells we need. Now suppose we redesign 
the gate-array base cell so that we can build a RAM bit cell using a single base 
cell that is 20 tracks high, 3 tracks wide, and has 4 logic cell transistors and 4 
RAM cell transistors. Assume that since the base cell now contains 8 transistors 
we only need 12 k base cells to implement 20 k-gate of random logic (the new 
base cell is less efficient than the old cell for implementing random logic). 




• b. Calculate the base cell area using the new base cell design. 

• c. Comment. 

Answer: 1.2 ¥ 108 l 2 , 1.1 ¥ 108 1 2 . 

3.26 (***Gate-array base cell, 60 min.) Figure 3.28 shows a simple gate-array base 
cell. Use the design rules shown in Table 2.16 (Problem 2.33) to calculate the 
minimum size of this base cell. Do this by determining which design rules apply to the 
labels shown adjacent to each space or width in the figure. In most cases each of the 
spaces is determined by a single rule related to the region labeled, for example, the 
contact width labeled 'cc' is 2 1 determined by rule C.l, the exact contact size. There is 
one exception, shown in the figure. Space 'aa' (bounding box, BB, to edge of pdiff) and 
width 'bb' (edge of pdiff to edge of contact) are determined by the minimum space 
labeled 'xx' (bounding box, BB, to poly edge) and width 'yy' (edge of poly to edge of 
contact). Space 'xx' is one half of the poly to poly spacing over field (rule P.4) because 
two base cells abut as shown in the figure. Width 'yy' is equal to the minimum poly 
overlap of contact (rule C.3). The distance 'aa + bb' is thus determined by the minimum 
distance 'xx + yy', as shown. The other distances are more straightforward to 
determine. 

Answer: 40 1 high by 26.25 1 wide. 




FIGURE 3.28 A simple gate-array base cell (Problem 3.26 ). 



3.27 (CIF, 15 min.) Here is the part of the CIF for a standard cell that describes the n 
-well (CWN) and p -well (CWP) structure. The statement B length height xCenter, 



yCenter is CIF for a box (CIF dimensions are in centimicrons, 0.01 m m): 

DS 1 ;LCWN ;B6000 

1560 13600,3660;B2480 60 11840,2850;B2320 60 15440, 2850;LCWP;B680 60 



13740, 2730;B6000 1380 13600,2010; 

• a. Draw the wells and BB. Label the dimensions in microns and 1 (1 = 0.4 mm). 

• b. This is a double-entry cell with m2 connectors at top and bottom. For this cell 
library the cell AB is 3 1 (120 centimicrons, determined by the well rules) inside 
the cell BB on all sides. What is the size of the cell AB in microns and 1 ? 

• c. The vertical (m2) routing pitch (the distance between centers of adjacent 
vertical m2 interconnect lines) is equal to the vertical track spacing and is 8 1 
(320 centimicrons). How many vertical tracks are there in this cell? 



3.28 (CIF, 60 min.) Figure 3.29 shows an example of CIF that describes a single 
rectangle (box) of ml with an accompanying label. 



(CIF written by the Tanner Research layout editor: L-Edit); 
(TECHNOLOGY: VLSIcmn6); 

(DATE: Thu, Jun 27, 1996); 

(FABCELL: NONE); 

(SCALING: 1 CIF Unit = 1/120 Lambda, 1 Lambda = 3/10 
Microns); 

DS1 2 8; 

9 CellO ; 

94 LabelText 60 180 CM; 

LCM; 

B 240 120 120 300; 

DF; 

E 




+ 



FIGURE 3.29 A simple CIF example (Problem 3.28 ). 



The CIF code has the following meaning: 

• Lines 15 are CIF comments. 

• Line 6 is a definition start for symbol 1 and marks the beginning of a symbol 
definition (a symbol is a piece of layout, symbol numbers are unique identifiers). 
The integers 2 and 8 define a scaling factor 2/8 (= 0.25) to be applied to distance 
measurements (the CIF unit, after scaling, is a centimicron or 0.01 m m). 

• Line 7 is a user extension or expansion (all extensions begin with a digit). L-Edit 
uses user extension 9 for cell names ( CellO in this case). 

• Line 8 is a user extension for a cell label located on layer CM (first-level metal 
in this technology) located at x = 60 units, y = 180 units (60, 180). Applying the 
scaling factor of 0.25, this translates to (15, 45) in centimicrons or (0.5, 1.5) in 
lambda. 

• Line 9 is a layer specification or command (begins with L ). 

• Line 10 is a box command and describes a box with (in order) length, L , of 240 
units; width, W , of 120 units; and center at x = 120 units and y = 300 units. 



